{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# From tokens to numbers: the document-term matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The bag of words model represents a document based on the frequency of the terms or tokens it contains. Each document becomes a vector with one entry for each token in the vocabulary that reflects the token’s relevance to the document.\n",
    "\n",
    "The document-term matrix is straightforward to compute given the vocabulary. However, it is also a crude simplification because it abstracts from word order and grammatical relationships. Nonetheless, it often achieves good results in text classification quickly and, thus, a very useful starting point.\n",
    "\n",
    "There are several ways to weigh a token’s vector entry to capture its relevance to the document. We will illustrate below how to use sklearn to use binary flags that indicate presence or absence, counts, and weighted counts that account for differences in term frequencies across all documents, i.e., in the corpus. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:29.575699Z",
     "start_time": "2020-06-20T17:16:29.574102Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.204824Z",
     "start_time": "2020-06-20T17:16:29.576888Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from collections import Counter\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import sparse\n",
    "from scipy.spatial.distance import pdist\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.ticker import ScalarFormatter\n",
    "import seaborn as sns\n",
    "from ipywidgets import interact, FloatRangeSlider\n",
    "\n",
    "# spacy for language processing\n",
    "import spacy\n",
    "\n",
    "# sklearn for feature extraction & modeling\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.207621Z",
     "start_time": "2020-06-20T17:16:30.205787Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.set_style('white')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Load BBC data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.283184Z",
     "start_time": "2020-06-20T17:16:30.208411Z"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "path = Path('..', 'data', 'bbc')\n",
    "files = sorted(list(path.glob('**/*.txt')))\n",
    "doc_list = []\n",
    "for i, file in enumerate(files):\n",
    "    topic = file.parts[-2]\n",
    "    article = file.read_text(encoding='latin1').split('\\n')\n",
    "    heading = article[0].strip()\n",
    "    body = ' '.join([l.strip() for l in article[1:]]).strip()\n",
    "    doc_list.append([topic, heading, body])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Convert to DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.290352Z",
     "start_time": "2020-06-20T17:16:30.284343Z"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 2225 entries, 0 to 2224\n",
      "Data columns (total 3 columns):\n",
      " #   Column   Non-Null Count  Dtype \n",
      "---  ------   --------------  ----- \n",
      " 0   topic    2225 non-null   object\n",
      " 1   heading  2225 non-null   object\n",
      " 2   body     2225 non-null   object\n",
      "dtypes: object(3)\n",
      "memory usage: 52.3+ KB\n"
     ]
    }
   ],
   "source": [
    "docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'body'])\n",
    "docs.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Inspect results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.306729Z",
     "start_time": "2020-06-20T17:16:30.291413Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>topic</th>\n",
       "      <th>heading</th>\n",
       "      <th>body</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1744</th>\n",
       "      <td>sport</td>\n",
       "      <td>Davenport hits out at Wimbledon</td>\n",
       "      <td>World number one Lindsay Davenport has critici...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1919</th>\n",
       "      <td>tech</td>\n",
       "      <td>California sets fines for spyware</td>\n",
       "      <td>The makers of computer programs that secretly ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1937</th>\n",
       "      <td>tech</td>\n",
       "      <td>Games maker fights for survival</td>\n",
       "      <td>One of Britain's largest independent game make...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>965</th>\n",
       "      <td>politics</td>\n",
       "      <td>Opposition grows to house arrests</td>\n",
       "      <td>The Conservatives have expressed \"serious misg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1799</th>\n",
       "      <td>sport</td>\n",
       "      <td>Officials respond in court row</td>\n",
       "      <td>Australian tennis' top official has defended t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>politics</td>\n",
       "      <td>Minimum rate for foster parents</td>\n",
       "      <td>Foster carers are to be guaranteed a minimum a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2038</th>\n",
       "      <td>tech</td>\n",
       "      <td>Europe backs digital TV lifestyle</td>\n",
       "      <td>How people receive their digital entertainment...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1823</th>\n",
       "      <td>sport</td>\n",
       "      <td>Roddick to face Saulnier in final</td>\n",
       "      <td>Andy Roddick will play Cyril Saulnier in the f...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>57</th>\n",
       "      <td>business</td>\n",
       "      <td>Electrolux to export Europe jobs</td>\n",
       "      <td>Electrolux saw its shares rise 14% on Tuesday ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>345</th>\n",
       "      <td>business</td>\n",
       "      <td>Disaster claims 'less than $10bn'</td>\n",
       "      <td>Insurers have sought to calm fears that they f...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         topic                            heading  \\\n",
       "1744     sport    Davenport hits out at Wimbledon   \n",
       "1919      tech  California sets fines for spyware   \n",
       "1937      tech    Games maker fights for survival   \n",
       "965   politics  Opposition grows to house arrests   \n",
       "1799     sport     Officials respond in court row   \n",
       "998   politics    Minimum rate for foster parents   \n",
       "2038      tech  Europe backs digital TV lifestyle   \n",
       "1823     sport  Roddick to face Saulnier in final   \n",
       "57    business   Electrolux to export Europe jobs   \n",
       "345   business  Disaster claims 'less than $10bn'   \n",
       "\n",
       "                                                   body  \n",
       "1744  World number one Lindsay Davenport has critici...  \n",
       "1919  The makers of computer programs that secretly ...  \n",
       "1937  One of Britain's largest independent game make...  \n",
       "965   The Conservatives have expressed \"serious misg...  \n",
       "1799  Australian tennis' top official has defended t...  \n",
       "998   Foster carers are to be guaranteed a minimum a...  \n",
       "2038  How people receive their digital entertainment...  \n",
       "1823  Andy Roddick will play Cyril Saulnier in the f...  \n",
       "57    Electrolux saw its shares rise 14% on Tuesday ...  \n",
       "345   Insurers have sought to calm fears that they f...  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Data drawn from 5 different categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.361820Z",
     "start_time": "2020-06-20T17:16:30.307628Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style  type=\"text/css\" >\n",
       "</style><table id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3\" ><thead>    <tr>        <th class=\"blank level0\" ></th>        <th class=\"col_heading level0 col0\" >count</th>    </tr></thead><tbody>\n",
       "                <tr>\n",
       "                        <th id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3level0_row0\" class=\"row_heading level0 row0\" >sport</th>\n",
       "                        <td id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3row0_col0\" class=\"data row0 col0\" >22.97%</td>\n",
       "            </tr>\n",
       "            <tr>\n",
       "                        <th id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3level0_row1\" class=\"row_heading level0 row1\" >business</th>\n",
       "                        <td id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3row1_col0\" class=\"data row1 col0\" >22.92%</td>\n",
       "            </tr>\n",
       "            <tr>\n",
       "                        <th id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3level0_row2\" class=\"row_heading level0 row2\" >politics</th>\n",
       "                        <td id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3row2_col0\" class=\"data row2 col0\" >18.74%</td>\n",
       "            </tr>\n",
       "            <tr>\n",
       "                        <th id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3level0_row3\" class=\"row_heading level0 row3\" >tech</th>\n",
       "                        <td id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3row3_col0\" class=\"data row3 col0\" >18.02%</td>\n",
       "            </tr>\n",
       "            <tr>\n",
       "                        <th id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3level0_row4\" class=\"row_heading level0 row4\" >entertainment</th>\n",
       "                        <td id=\"T_c84297fe_b319_11ea_8c33_6045cb72e6b3row4_col0\" class=\"data row4 col0\" >17.35%</td>\n",
       "            </tr>\n",
       "    </tbody></table>"
      ],
      "text/plain": [
       "<pandas.io.formats.style.Styler at 0x7f5c41434b50>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs.topic.value_counts(normalize=True).to_frame('count').style.format({'count': '{:,.2%}'.format})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Token Count via Counter()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.417187Z",
     "start_time": "2020-06-20T17:16:30.362799Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total word count: 842,910 | per article: 379\n"
     ]
    }
   ],
   "source": [
    "# word count\n",
    "word_count = docs.body.str.split().str.len().sum()\n",
    "print(f'Total word count: {word_count:,d} | per article: {word_count/len(docs):,.0f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.586573Z",
     "start_time": "2020-06-20T17:16:30.417998Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "500 1000 1500 2000 "
     ]
    }
   ],
   "source": [
    "token_count = Counter()\n",
    "for i, doc in enumerate(docs.body.tolist(), 1):\n",
    "    if i % 500 == 0:\n",
    "        print(i, end=' ', flush=True)\n",
    "    token_count.update([t.strip() for t in doc.split()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:30.636553Z",
     "start_time": "2020-06-20T17:16:30.587488Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens = (pd.DataFrame(token_count.most_common(), columns=['token', 'count'])\n",
    "          .set_index('token')\n",
    "          .squeeze())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.264360Z",
     "start_time": "2020-06-20T17:16:30.637589Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA+gAAAEYCAYAAADPrtzUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzdeZyN9f//8ecxCzFjS7QwlrGEIkskg0TZYkSWIakkKmtkyPYZ+9JMtiTSN2HGluhDZUtkjSlr2cYu+1hmBrNevz/md87HIMx1rmkuPO63W7db5zrXec37Os65zvV6L6/LYRiGIQAAAAAAkKmyZHYDAAAAAAAACToAAAAAALZAgg4AAAAAgA2QoAMAAAAAYAMk6AAAAAAA2AAJOgAAAAAANkCCDgCwvePHj6tUqVJ6/fXXb3qub9++KlWqlKKjo03F3rFjhwYNGnTL59atW6fatWvrtdde07Vr10zFz2iTJk3SypUrb/ncqFGj9MILLygwMFCBgYHq0aOHJCk5OVnDhw9X/fr19dJLLykiIiLdf3fhwoV64YUX1KFDh5ue27t3r9q1a6emTZuqWbNm2rVr1037dOnSRUOGDLll7GHDhrnaHBgYqOeee06NGzfW5cuX02wPDAxU6dKl9X//93+SpBUrVqhx48YKDAzUG2+8oaNHj94y/ttvv33Hz8vmzZv1yiuv3OltAADAUp6Z3QAAAO5G1qxZdejQIZ04cUJPPPGEJOnKlSv6/fff3Yp74MABnT59+pbPLV26VC1atND777/v1t/ISJs3b1bx4sVv+dwff/yhsLAwVaxYMc32OXPm6PDhw1qyZIni4uLUqlUrlS1bVuXKlbvrv7to0SL17NlTgYGBabZfvXpVHTp00PDhw1WrVi2tXLlSvXv31k8//eTaZ9q0adq6dasaNmx4y9gDBgxw/f/x48fVtm1bjRkzRjlz5tTixYtdz82cOVPLli3T66+/rmvXrumjjz7S4sWLVbhwYX399dcaNmyYpk6delP89evX3/VxAgDwbyJBBwDcEzw8PNSgQQP997//VefOnSVJy5cvV506dfTVV1+59ps7d65mzpypLFmyKF++fBo4cKCKFi2qrVu3atSoUUpJSZEkderUSeXKldOECRMUExOjfv36aeTIka44X375pVatWqWsWbMqJiZG2bNn17Zt23TmzBmVKlVKn3zyiT7//HMtX75cKSkpeuKJJzR48GAVKFBA+/fv14ABA3TlyhUVL15cJ06cUK9evfTEE0+ocePG+uOPPySlJp/XP54/f74iIiKUkpKi3Llza+DAgfL391ffvn3l4+OjvXv36tSpUypVqpRGjx6tRYsWadeuXRozZow8PDz00ksvudqfkJCgP//8U19++aWOHTumIkWKqF+/fnr88ce1cuVKtWzZUp6ensqVK5caNWqk77///qYEPSYmRiEhIdqzZ48cDodq1KihDz/8UGPGjNHOnTt1/PhxXbhwQW+++abrNevXr1ehQoVUq1YtSVKdOnVUsGBB1/ObN2/Wr7/+qtatW+vy5ct3/HcfOHCg3nrrLZUuXTrN9iNHjujzzz/XggUL5OXlpYSEBBmGoZiYGElSXFycsmbNelO8fv36SZLat2+vqVOnKjY2VkOGDNHFixflcDj09ttvq2nTpmles3XrVvXu3dvV2fHzzz/r888/V2JiorJly6bg4GBVqFBBEydO1IkTJ3T27FmdOHFCBQoU0NixY5U/f36Fh4drzpw58vLyUtasWTVkyJB/7FgBADzADAAAbO7YsWPGM888Y+zcudOoX7++a3v79u2NvXv3GiVLljTOnz9vbNiwwahbt65x/vx5wzAM49tvvzUaNGhgpKSkGG+88YaxZMkSwzAM46+//jL+85//uPZ59913b/l3g4ODjS+//NIwDMOYMGGCUa9ePSMxMdEwDMP47rvvjB49ergez5kzx3jnnXcMwzCMV155xZg3b55hGIaxZcsWo1SpUsamTZtcx3HjcRmGYWzevNlo06aNceXKFcMwDOPXX391HWtwcLDRqlUrIz4+3khISDCaNm1qLFiwwDAMw3j99deNH3/88aa2Hz161HjnnXeMvXv3GikpKca0adOMwMBAIyUlxahXr57xxx9/uPadN2+e8cEHH9wUo0+fPsbQoUONlJQUIz4+3nj77beNL7744rZ/d+rUqUbXrl2Nfv36Ga+++qrRvn17Y9euXYZhGMapU6eMxo0bG6dPnzYmTJhghISE3PJ9d/rll1+Ml19+2UhKSrrpue7duxufffZZmm3fffedUbZsWaN69epGtWrVjMOHD98yrvPzkpiYaNSpU8dYtmyZq301atQwfv/9d2PTpk1Go0aNjI0bNxp169Y1/vrrL8MwDOPQoUPGK6+8YkRHRxuGYRj79u0zqlevbsTFxRkTJkww6tSpY8TExBiGYRidOnUyxo8fbyQlJRlly5Y1Tp8+7WrnnDlzbnvsAIAHEyPoAIB7xlNPPSUPDw/t2rVLDz/8sOLi4lSyZEnX87/++qsaNmyovHnzSpKaNWum4cOH6/jx42rQoIGGDBmin3/+Wc8//7w+/PDDdP/9Z555Rp6eqT+dq1ev1s6dO9W8eXNJUkpKiq5evaro6GgdOHDANQpbuXJllSpV6o6xf/nlFx05ckStW7d2bbt8+bIuXrwoSapRo4a8vb0lSSVLltSlS5duG69QoUKaNm2a63GHDh00efJkHT9+XIZhyOFwuJ4zDENZstxclmbt2rWKiIiQw+GQt7e3WrdurRkzZujdd9/9x7+blJSkNWvW6JtvvlH58uW1cuVKvfvuu1q9erV69eqlfv36KX/+/Hd8PyRpxowZ6tSpkzw8PNJsP3nypNatW6dhw4a5tu3du1efffaZfvjhB/n5+embb75R165dtXjx4jTHer3Dhw8rPj5eL7/8siSpQIECevnll/Xrr7+qatWqOnXqlDp37qygoCA9+eSTklJnCJw5cybNrAGHw+Fa716lShX5+PhIksqUKaNLly7Jw8ND9evXV+vWrfXCCy8oICDANcMAAIDrkaADAO4pTZo00ffff6+8efPetP7ZOX39eoZhKCkpSa1bt1bt2rW1fv16/frrr5o0aVKaddF3I3v27Gn+1jvvvKM2bdpISp1SfunSJWXLlk0Oh0OGYbj29fLykqSbticmJqaJFxgYqI8++sj1+MyZM8qVK5ckKVu2bK59b4xzK3v27NGePXvSTNc2DENeXl567LHHdObMGdf2M2fO6NFHH70pRkpKSprkNiUlRUlJSbf9u/nz55e/v7/Kly8vSapbt64GDBig3bt369ixYxo1apQk6dy5c0pOTlZ8fLyGDx9+U5zo6Ght375dkyZNuum5ZcuW6aWXXnIlwlJqQb+KFSvKz89PktS2bVuNHDlSFy5ccHXY3Cg5Ofmm5N35eZFSl1VMnTpV77//vurXr6/y5csrJSVF1apV07hx41yvOXnypPLnz68VK1b847/TJ598on379mnDhg2aOnWqFi9erPHjx9/2vQQAPHio4g4AuKcEBgbqp59+0g8//HBTle0aNWrohx9+cFXo/vbbb5U7d24VLlxYrVu31l9//aVmzZpp6NChunz5ss6ePSsPD487Jp23EhAQoAULFig2NlaSNH78ePXp00fZs2dXpUqVNHfuXEn/S5QlKWfOnEpMTNSBAwckpRahuz7e0qVLXYlzRESE2rdvf8d2/FP7s2TJouHDh+vYsWOSpPDwcJUqVUqPPvqo6tSpo2+//VZJSUm6fPmyli5dqrp1697yGGfNmiXDMJSQkKB58+bp+eefv217atasqePHj7sqt2/ZskUOh0NlypTRmjVrtHjxYi1evFitW7dWw4YNb5mcS9Lvv/+up59+Ok2niNNvv/2m5557Ls22MmXKaMuWLTp37pwkaeXKlSpYsOAtk3Pne1asWDF5enpq+fLlkqTTp09r2bJlrmN85JFHVLFiRQUHB6tPnz66evWqqlWrpvXr1ysqKkqStGbNGjVp0uS2Vf6jo6NVq1Yt5c6dW2+++aZ69OihnTt33vZ9BAA8mBhBBwDcUwoUKCB/f3/5+voqd+7caZ6rXr263nzzTbVv314pKSnKmzevvvjiC2XJkkW9e/fWiBEjNG7cODkcDnXp0kUFCxZUcnKyPvvsM3Xp0uWWo7X/pEWLFjp9+rRatmwph8Ohxx57zDU6PGbMGA0YMEDz58/XE088oXz58kmSfH199dFHH6ljx47Kmzev6tev74oXEBCgjh076u2335bD4ZCPj48mTZr0j9OznV588UWFhYUpMTFRr776qmt7yZIlNWDAAL333ntKTk7Wo48+qrCwMElSUFCQjh49qsDAQCUmJqpVq1aqUqXKTbEHDBigYcOGqXHjxkpMTFSNGjVcBfr+ySOPPKLPPvtMISEhunr1qry9vTVx4sRbFmy7nnM0uXv37pJSp587q/Xf6MiRIzc9V61aNXXo0EHt2rWTl5eXcuXKpcmTJ9/y9fXr11e7du00ceJETZ48WcOGDdPEiROVnJysDz74QM8995w2b97s2v/VV1/VsmXLNGrUKIWEhGjIkCH68MMPZRiGPD099fnnnytHjhz/eGx58+bVe++9pzfffFPZsmWTh4dHmun5AAA4OYw7zZEDAABueeWVVzRw4EBVrVo1s5tiW4cPH9aCBQvUu3fvzG4KAACZhinuAAAg0x06dEjt2rXL7GYAAJCpGEEHAAAAAMAGGEEHAAAAAMAGSNABAAAAALCBeyZB79ChQ2Y3AQAAAACADHPPJOgXLlzI7CYAAAAAAJBh7pkEHQAAAACA+xkJOgAAAAAANkCCDgAAAACADZCgAwAAAABgAyToAAAAAADYAAk6AAAAAAA2QIIOAAAAAIANkKADAAAAAGAD93SCfi0x2dL9AAAAAADILJ6Z3QB3ZPPyUJG+S++43+FRjf6F1gAAAAAAYN49PYIOAAAAAMD9ggQdAAAAAAAbIEEHAAAAAMAGSNABAAAAALABEnQAAAAAAGyABB0AAAAAABsgQQcAAAAAwAZI0AEAAAAAsAESdAAAAAAAbIAEHQAAAAAAGyBBBwAAAADABkjQAQAAAACwARJ0AAAAAABsgAQdAAAAAAAbIEEHAAAAAMAG7ipBP3/+vGrVqqWoqCgdOXJEQUFBatOmjQYPHqyUlBRJ0rx589SsWTO1bNlSq1evliRdu3ZNXbt2VZs2bdSxY0dFR0dLkrZt26YWLVqodevWmjRpUgYdGgAAAAAA9447JuiJiYkaNGiQsmXLJkkaOXKkevToofDwcBmGoVWrVuns2bOaOXOm5syZo+nTpyssLEwJCQmKiIhQyZIlFR4erqZNm2ry5MmSpMGDBys0NFQRERHavn27du/enbFHCQAAAACAzd0xQR89erRat26t/PnzS5J2796tKlWqSJJq1qypDRs2aMeOHapQoYK8vb3l6+srPz8/7dmzR5GRkapRo4Zr340bNyo2NlYJCQny8/OTw+FQQECANm7cmIGHCAAAAACA/d02QV+4cKHy5s3rSrIlyTAMORwOSVKOHDkUExOj2NhY+fr6uvbJkSOHYmNj02y/fl8fH580+8bExFh6UAAAAAAA3Gs8b/fkt99+K4fDoY0bN+qvv/5ScHCwax25JMXFxSlnzpzy8fFRXFxcmu2+vr5ptt9u35w5c1p9XAAAAAAA3FNuO4I+e/ZszZo1SzNnzlTp0qU1evRo1axZU5s3b5YkrV27VpUrV1a5cuUUGRmp+Ph4xcTEKCoqSiVLllTFihW1Zs0a176VKlWSj4+PvLy8dPToURmGoXXr1qly5coZf6QAAAAAANjYbUfQbyU4OFgDBw5UWFiYihUrpnr16snDw0Pt2rVTmzZtZBiGevbsqaxZsyooKEjBwcEKCgqSl5eXQkNDJUkhISHq3bu3kpOTFRAQoPLly1t+YAAAAAAA3EschmEYmd2Iu9GsWTMtXLjwpu1F+i6942sPj2qUEU0CAAAAAMAyd3UfdAAAAAAAkLFI0AEAAAAAsAESdAAAAAAAbIAEHQAAAAAAGyBBBwAAAADABkjQAQAAAACwARJ0AAAAAABsgAQdAAAAAAAbIEEHAAAAAMAGSNABAAAAALABEnQAAAAAAGyABB0AAAAAABsgQQcAAAAAwAZI0AEAAAAAsAESdAAAAAAAbIAEHQAAAAAAGyBBBwAAAADABkjQAQAAAACwARJ0AAAAAABsgAQdAAAAAAAbIEEHAAAAAMAGSNABAAAAALABEnQAAAAAAGyABB0AAAAAABsgQQcAAAAAwAZI0AEAAAAAsAESdAAAAAAAbIAEHQAAAAAAGyBBBwAAAADABkjQAQAAAACwARJ0AAAAAABsgAQdAAAAAAAbIEEHAAAAAMAGSNABAAAAALABEnQAAAAAAGzA8047JCcna8CAATp06JA8PDw0cuRIGYahvn37yuFwqESJEho8eLCyZMmiefPmac6cOfL09NR7772n2rVr69q1a/roo490/vx55ciRQ6NHj1bevHm1bds2DR8+XB4eHgoICFCXLl3+jeMFAAAAAMCW7jiCvnr1aknSnDlz1K1bN40cOVIjR45Ujx49FB4eLsMwtGrVKp09e1YzZ87UnDlzNH36dIWFhSkhIUEREREqWbKkwsPD1bRpU02ePFmSNHjwYIWGhioiIkLbt2/X7t27M/ZIAQAAAACwsTsm6HXr1tXQoUMlSX///bfy5cun3bt3q0qVKpKkmjVrasOGDdqxY4cqVKggb29v+fr6ys/PT3v27FFkZKRq1Kjh2nfjxo2KjY1VQkKC/Pz85HA4FBAQoI0bN2bgYQIAAAAAYG93tQbd09NTwcHBGjp0qOrVqyfDMORwOCRJOXLkUExMjGJjY+Xr6+t6TY4cORQbG5tm+/X7+vj4pNk3JibGyuMCAAAAAOCectdF4kaPHq1ly5Zp4MCBio+Pd22Pi4tTzpw55ePjo7i4uDTbfX1902y/3b45c+a04ngAAAAAALgn3TFBX7Rokb744gtJ0kMPPSSHw6GnnnpKmzdvliStXbtWlStXVrly5RQZGan4+HjFxMQoKipKJUuWVMWKFbVmzRrXvpUqVZKPj4+8vLx09OhRGYahdevWqXLlyhl4mAAAAAAA2Nsdq7i//PLL6tevn9q2baukpCR9/PHH8vf318CBAxUWFqZixYqpXr168vDwULt27dSmTRsZhqGePXsqa9asCgoKUnBwsIKCguTl5aXQ0FBJUkhIiHr37q3k5GQFBASofPnyGX6wAAAAAADYlcMwDCOzG3E3mjVrpoULF960vUjfpXd87eFRjTKiSQAAAAAAWOau16ADAAAAAICMQ4IOAAAAAIANkKADAAAAAGADJOgAAAAAANgACToAAAAAADZAgg4AAAAAgA2QoAMAAAAAYAMk6AAAAAAA2AAJOgAAAAAANkCCDgAAAACADZCgAwAAAABgAyToAAAAAADYAAk6AAAAAAA2QIIOAAAAAIANkKADAAAAAGADJOgAAAAAANgACToAAAAAADZAgg4AAAAAgA2QoAMAAAAAYAMk6P/ftcRkS/cDAAAAACA9PDO7AXaRzctDRfouveN+h0c1+hdaAwAAAAB40DCCDgAAAACADZCgAwAAAABgAyToAAAAAADYAAk6AAAAAAA2QIIOAAAAAIANkKADAAAAAGADJOgAAAAAANgACToAAAAAADZAgg4AAAAAgA2QoAMAAAAAYAMk6AAAAAAA2AAJOgAAAAAANkCCDgAAAACADZCgAwAAAABgA563ezIxMVEff/yxTpw4oYSEBL333nsqXry4+vbtK4fDoRIlSmjw4MHKkiWL5s2bpzlz5sjT01PvvfeeateurWvXrumjjz7S+fPnlSNHDo0ePVp58+bVtm3bNHz4cHl4eCggIEBdunT5t44XAAAAAABbuu0I+vfff6/cuXMrPDxc06ZN09ChQzVy5Ej16NFD4eHhMgxDq1at0tmzZzVz5kzNmTNH06dPV1hYmBISEhQREaGSJUsqPDxcTZs21eTJkyVJgwcPVmhoqCIiIrR9+3bt3r37XzlYAAAAAADs6rYJev369dW9e3fXYw8PD+3evVtVqlSRJNWsWVMbNmzQjh07VKFCBXl7e8vX11d+fn7as2ePIiMjVaNGDde+GzduVGxsrBISEuTn5yeHw6GAgABt3LgxAw8RAAAAAAD7u22CniNHDvn4+Cg2NlbdunVTjx49ZBiGHA6H6/mYmBjFxsbK19c3zetiY2PTbL9+Xx8fnzT7xsTEZMSxAQAAAABwz7hjkbiTJ0/qjTfeUGBgoBo3bqwsWf73kri4OOXMmVM+Pj6Ki4tLs93X1zfN9tvtmzNnTiuPKdNdS0y2dD8AAAAAwP3vtkXizp07p7fffluDBg1StWrVJEllypTR5s2bVbVqVa1du1bPPfecypUrp3Hjxik+Pl4JCQmKiopSyZIlVbFiRa1Zs0blypXT2rVrValSJfn4+MjLy0tHjx5VoUKFtG7duvuuSFw2Lw8V6bv0jvsdHtXoX2gNAAAAAOBecNsEfcqUKbp8+bImT57sKvDWv39/DRs2TGFhYSpWrJjq1asnDw8PtWvXTm3atJFhGOrZs6eyZs2qoKAgBQcHKygoSF5eXgoNDZUkhYSEqHfv3kpOTlZAQIDKly+f8UcKAAAAAICNOQzDMDK7EXejWbNmWrhw4U3brRyptmssAAAAAMD9745r0AEAAAAAQMYjQQcAAAAAwAZI0AEAAAAAsAESdAAAAAAAbIAE3ea4pzoAAAAAPBhue5s1ZD7uqQ4AAAAADwZG0AEAAAAAsAESdAAAAAAAbIAE/QHCenYAAAAAsC/WoD9AWM8OAAAAAPbFCDoAAAAAADZAgg5TmC4PAAAAANZiijtMYbo8AAAAAFiLEXQAAAAAAGyABB0AAAAAABsgQQcAAAAAwAZI0AEAAAAAsAESdAAAAAAAbIAEHQAAAAAAGyBBBwAAAADABkjQAQAAAACwARJ0AAAAAABsgAQdAAAAAAAbIEEHAAAAAMAGSNABAAAAALABEnQAAAAAAGyABB0AAAAAABsgQQcAAAAAwAZI0AEAAAAAsAESdAAAAAAAbIAEHQAAAAAAGyBBBwAAAADABkjQAQAAAACwARJ0AAAAAABsgAQdAAAAAAAbIEEHAAAAAMAG7ipB3759u9q1aydJOnLkiIKCgtSmTRsNHjxYKSkpkqR58+apWbNmatmypVavXi1Junbtmrp27ao2bdqoY8eOio6OliRt27ZNLVq0UOvWrTVp0qSMOC4AAAAAAO4pd0zQp02bpgEDBig+Pl6SNHLkSPXo0UPh4eEyDEOrVq3S2bNnNXPmTM2ZM0fTp09XWFiYEhISFBERoZIlSyo8PFxNmzbV5MmTJUmDBw9WaGioIiIitH37du3evTtjjxIAAAAAAJu7Y4Lu5+eniRMnuh7v3r1bVapUkSTVrFlTGzZs0I4dO1ShQgV5e3vL19dXfn5+2rNnjyIjI1WjRg3Xvhs3blRsbKwSEhLk5+cnh8OhgIAAbdy4MYMODwAAAACAe8MdE/R69erJ09PT9dgwDDkcDklSjhw5FBMTo9jYWPn6+rr2yZEjh2JjY9Nsv35fHx+fNPvGxMRYdkAAAAAAANyL0l0kLkuW/70kLi5OOXPmlI+Pj+Li4tJs9/X1TbP9dvvmzJnTnWMAAAAAAOCel+4EvUyZMtq8ebMkae3atapcubLKlSunyMhIxcfHKyYmRlFRUSpZsqQqVqyoNWvWuPatVKmSfHx85OXlpaNHj8owDK1bt06VK1e29qgAAAAAALjHeN55l7SCg4M1cOBAhYWFqVixYqpXr548PDzUrl07tWnTRoZhqGfPnsqaNauCgoIUHBysoKAgeXl5KTQ0VJIUEhKi3r17Kzk5WQEBASpfvrzlBwYAAAAAwL3krhL0ggULat68eZKkokWLatasWTft07JlS7Vs2TLNtoceekgTJky4ad9nnnnGFQ8AAAAAAJiY4g4AAAAAAKxHgo5Mdy0x2bL9rIwFAAAAAP+mdK9BB6yWzctDRfouveN+h0c1+ldjAQAAAMC/iRF0AAAAAABsgAQdAAAAAAAbIEEH/gHr2QEAAAD8m1iDDvwD1rMDAAAA+Dcxgg78C6hUDwAAAOBOGEEH/gVUqgcAAABwJ4ygAwAAAABgAyToAAAAAADYAAk6AAAAAAA2QIIOPMAoOAcAAADYB0XigAeYlQXnriUmK5uXh2X7AQAAAA8aEnQAlqC6PAAAAOAeprgDAAAAAGADJOgAbIe18QAAAHgQMcUdgO3YdW28XWNZ/TcBAACQOUjQAdzXrEz27RrrbuPdbSySfQAAgMxBgg4ASMOuyf6DEAsAADzYSNABABnGymT/QYhFsg8AwIONBB0AAJuwMtkHAAD3Hqq4AwAAAABgAyToAADch+7mNoTcqhAAAHthijsAAPchu66NZ509AAD/jAQdAADcFkX1AAD4d5CgAwCAe5Jdk306DgAAZpGgAwCAB96DMEuAWJkXCwDuFgk6AACATdm144BY6Yt1r3cw/NuxgAcZCToAAACQge4m2ZfuLuF/EGLZtePArrFwfyFBBwAAAGAbdu04sGssOg7uLyToAAAAAHCPouPg/ojlRIIOAAAAALCUXTsO7BrLKctd7wkAAAAAADIMCToAAAAAADaQaVPcU1JS9J///Ed79+6Vt7e3hg0bpsKFC2dWcwAAAAAAyFSZNoK+cuVKJSQkaO7cuerVq5dGjRqVWU0BAAAAACDTZVqCHhkZqRo1akiSnnnmGe3atSuzmgIAAAAAQKZzGIZhZMYf7t+/v15++WXVqlVLkvTCCy9o5cqV8vS89az7qlWr6oknnvg3mwgAAAAAgOXy5Mmj6dOn37Q909ag+/j4KC4uzvU4JSXlH5NzSdq8efO/0SwAAAAAADJFpk1xr1ixotauXStJ2rZtm0qWLJlZTQEAAAAAINNl2hR3ZxX3ffv2yTAMjRgxQv7+/pnRFAAAAAAAMl2mJegAAAAAAOB/Mm2KOwAAAAAA+B8SdAAAAAAAbIAEHQAAAAAAGyBBhxITEzO7CX8QLDoAACAASURBVACABxBlcAAASIsE/R9ER0crJSXF1Gs//fRTSdLKlSutbJJlIiIiVK9ePdWpU0cvvviiGjVqlNlNAmCR2NhY7d27V1euXHE7VkpKipKTk7V161YlJCRY0Dr37dy5M83j3377ze2Yly5dcjuGJO3atcuSOA+S9u3bWxJn5MiRlsQBMtL3339vSZxWrVqpdevWaf5zbjPLyt8OO7Pyd+3ixYvasWOHoqOjLWqdvSQlJaV5fPnyZbfiHT58WGvWrNGpU6fuy85ZK4/J07JImWzjxo06duyYypUrp6JFiypr1qym4mzatEn9+/eXj4+PYmJiNHToUFWvXj1dMVatWqX8+fNr5syZOn/+fJrnWrVqla5Yf//99z8+9/jjj6crltP8+fM1c+ZMff7556pfv75mzJhhKo7T+fPnFR8f73a7tmzZoqtXr8owDA0dOlTdu3dX48aNTcXq1auXQkNDTb32RpMnT9b777/vehwaGqpevXqZihUbG6tp06bp7NmzeuGFF1SqVCkVLlzYVKyTJ09qyZIlad77Ll26mIp15coVXb58WZ6enpo7d66aNm2qJ554wlSs5ORkLVy4UCdPnlTVqlVVokQJ5c2b13SsP//8U9euXXNte/bZZ03F2r9/v2JjY5UlSxaFhYWpc+fOqlatWrpivPjii3I4HK7Hnp6eSkpKkre3t3788cd0xUpOTlZycrI+/PBDffrppzIMQykpKXr33Xf1zTffpCuW008//aQpU6YoOTlZ9evXl8PhSPPZTY+xY8eqUKFC+vvvv7V7927ly5dPo0ePTleMdevW/eNzAQEB6Yq1detWHThwQF9//bXeeustSanvYXh4uJYsWZKuWE6//fabhgwZ4nq/Hn/8cbVo0cJULEmaPn26Tpw4oSZNmqhJkybKmTOn6VhWfY927NihpUuXpjlP/Oc//zHVnhs/r4ZhqGPHjqY/r5J1FzRRUVG6fPmyW+/59WJjY+VwOLRixQrVrl1buXLlSncMK8/RkjRkyBANGjTI9bhPnz4aM2bMXb++X79+//ic2Q6OefPmacaMGbp27ZoMw5DD4dCqVavSHcd5PkhMTNTVq1f12GOP6dSpU3r44Yf1888/m2pbbGys1q5dmyYJa9q0abrjDB06VK+99ppKly5tqh3Xmzdvnpo0aeJ2nLCwMLdjXM/K3w7Jmt/b61l1jW/F75rTDz/8oPHjx8vf31/79+9Xly5dFBgYmO44e/bsUf/+/XXq1Ck98sgjGj58uMqWLWuqTVLqNfnnn3+uw4cPq0SJEurcuXO6z19nz55VbGysgoODNWbMGNf1SXBwsBYsWGCqXbNmzdKKFSt06dIlNW3aVEePHk1zPksvqz4TixYt0hdffKGEhAS3zmGS1KFDB3311VemXnuj+yJBDwsL06lTpxQVFSUvLy9NnTrV9Mlr/PjxCg8PV4ECBXT69Gl16dIl3Qn6iBEjtH79eiUkJGj//v06evSoChYsaOriqmfPnpJSe+ni4uJUokQJHThwQPny5dN3332X7niSlCdPHuXPn19xcXGqWrWqJkyYYCqOlHqht3btWuXPn9/1wZ4zZ46pWGPHjtUnn3yikJAQRUREqEePHqYT9ISEBO3Zs0dFixZ1JVTe3t7pijF//nwtWLBAUVFRWrt2raTUi9SkpCTTCfrHH3+smjVrasuWLcqXL5/69++vWbNmmYrVvXt3VatWTY899pip11+vd+/eatasmZYvX67ixYtr0KBBmj59uqlYgwYNUv78+bVhwwY99dRTCg4O1rRp00zF6tatmy5fvqxHHnlEkuRwOEwn6IMHD1b//v01ceJE9ezZU2PHjk33BcNPP/0kwzAUEhKi1q1bq1y5cvrzzz8VHh6e7vZ8++23mjJlis6fP6/69etLkrJkyaLKlSunO5bT119/rXnz5qlDhw56//331bx5c9MXWZGRkfroo4/Url07zZw509RI59KlS//xufQm6Dlz5tS5c+eUkJCgs2fPSkp9vz788MN0t8tp/PjxmjVrlrp27arOnTsrKCjIrQT9008/1aVLl7RkyRJ1795defPmVcuWLVW1atV0x7LqexQcHKyOHTu6nbg6P6/nzp1TvXr1JLn/eZWkSpUqufV6p4MHD+q5555Tnjx55HA45HA49Ouvv5qK1adPH1WvXl1//PGHUlJStGLFCn322WfpjmPVOXr27Nn6/PPPdfHiRS1fvlxSasdG8eLF0xWnYcOGklJn0VWoUEEVK1bUzp07b5qVkh5z5szR1KlTXedos5ydeb1791avXr302GOP6fTp027NjHj//feVP39+1/t/fedqetSqVUtTpkzR6dOnXZ1vPj4+pmIlJCSoadOmKlq0qLJkSZ3EamYwwWwH+j+x8rdDsub31snKa3wrftecZsyYoYULFypHjhyKjY1V+/btTSXow4cP1/Dhw/Xkk0/qr7/+UkhIiOnraEnq0aOHGjRooNdee02RkZHq06ePvvjii3TF2L59u2bMmKFDhw5p4MCBklLP9+n93b7e0qVLFR4erjfeeENvvvmmmjdvbjqWlZ+JadOmacqUKZZcS/v6+mrlypVpvt9FixY1Feu+SNAjIyM1e/ZstWvXTq+++qoiIiJMx/Lw8FCBAgUkSQUKFDDVI1OuXDmVK1dOnp6eWrBggfz9/bV27VpTPedz586VJH3wwQcaPXq0fHx8dOXKFbcuSp0fIGcy7c7UnB07dmjlypWuD6I7smbNqocfflienp565JFH3Jp6dPjw4TQ/LmZ6xAIDA1WtWjV98cUX6ty5s6TUE9TDDz9sul0XL17Ua6+9pu+//14VK1Z0a/QoR44crg4cd12+fFl16tTRzJkzNWbMGNMXtpJ09OhRDR8+XJGRkXrxxRc1depU07EuXLhgKvm9FU9PT5UoUUKJiYl65plnlJycnO4Yzk4eZ6+tJJUpU0aHDh1KdyyHw6Gff/5Z1apVc104O7eblSVLFnl7e7sSlIceesh0rJSUFO3YsUMFCxZUQkKCqfNESEiI6b9/o5IlS6pkyZLy9PTUd999p6SkJBmGIU9PT7344oumYmbJkkW5c+eWw+FQ1qxZlSNHDrfbee7cOf3999+6cOGC/P399dNPP2nRokXpTjSs+h4VLlxYzZo1M/Xa67Vs2VItW7bUggUL9Nprr7kdz6lHjx5uvd45KmymE+SfnDhxQoGBgVqwYIFbF/FWnaPbtm2rtm3basqUKa7fIjNq1KghSfq///s/dezYUVJqB4lzRooZefLksTRZPH78uOtCuUCBAjp58qTpWIZh6JNPPnG7TTVr1lTNmjUVHR2t4cOHa8yYMapfv766du2a7mPv3bu32+3JCFb+dkjW/N46WXmNb8XvmpPD4XD9Zvj4+JgewTUMQ08++aQkqXTp0vL0dD81a9OmjSTpySef1E8//ZTu19etW1d169bVmjVrVKtWLbfbI/1vtpTZQbPrWfmZKFSokOmZrDeKjo7W+PHjdezYMdfArNkZZvdFgp6cnKz4+Hg5HA4lJye7lSz6+Pho5syZevbZZ7Vlyxblzp3bdKyVK1fe1LtmZnqVJJ06dcrVY5s9e3adOXPGdLuGDRumo0ePqlevXvrqq69MTXd08vPzU3x8vNsncyn1vX/rrbfUpk0bzZ492/RUeUn673//K8MwFB0drdy5c8vDwyPdMby9vVWwYEENGjRIu3btciUEkZGReuWVV0y3LSoqSlLqv6k7n9USJUpo6dKlKl26tOuEZ7anLjExUV999ZXKlCmjAwcOKC4uznS7kpOTXT96ziluZj3++OM6efKkJT2bDodDvXr1Us2aNfXDDz+49Zn19fXVuHHjVK5cOf3xxx+mLlAfffRRSamjdVapXLmyPvzwQ50+fVqDBg3S008/bTpW06ZNNXToUI0cOVKffPKJ2rVrl+4YzqmS13N3Ctny5cstW6Lj5+en0NBQXbx4UVOnTnXrnCNJLVq0UNasWdWyZUt1797ddQHSoUOHdMey6ntUr1499ezZU/7+/q5t7kyzrl69urp166aoqCgVKVJE/fr1U8GCBU3Hc9euXbt07do1NWnSRBUqVJDk/rT5xMRE/fDDDypevLiio6N18eJFU3GsPEdLqQnrokWL0mwzc01x5coVbdy4UU8//bT++OMPU4VinaNVCQkJ6tChg8qUKeM6RncGEPz9/fXRRx+pXLly2rZtm1szLEqVKqXt27enmZpuJimIiorSwoULtXr1alWpUkXh4eFKSkpS165dtXDhwruKsXr1atWuXfuWnblVqlRJd5usVrlyZfXq1cuS3w7J2t9bK6/xAwMDNXToUI0YMUJjx47VG2+8YTqWn5+fRo0apcqVK2vr1q3y8/MzFcfT01OrV69W5cqVtWXLFrcSV0kqVqyYFi9erOeee067d+9W7ty5XZ+79J5/fvrpp5sSfLOzWho2bKjXX39dJ06cUMeOHVW3bl1TcSRrPxPZsmXTO++8k+Y8bfYcFhQUpPHjx+v555/Xvn373Oocdxj3wSr9H3/8UZMmTVJ0dLQee+wxvfnmm6bX+Hz22WeKi4vTwYMHVaxYMcXGxmrIkCGmYrVu3TrNNJU2bdqYHg389NNPFRkZqaeeeko7duxQvXr1LCuu447WrVvr8OHDKly4sOuDnd6pOfPnz1eLFi0UGhqqo0ePqnDhwjp8+LCKFCli+kuyefNmffzxx/L19dXly5dN1RJw6ty5sxITE3XmzBklJycrf/78+vrrr03F2rdvnwYOHKioqCgVK1ZMgwcPNr3WqF27drp48aIlPXWRkZFatWqVOnfurP/+9796+umnXSPE6bVlyxYNHDhQp06dUsGCBfXxxx/r+eefT1cM5zSqhIQEXblyJU1H2e3WNd9OdHS0du7cqZo1a2rz5s168sknTXfAXblyRd999532798vf39/tWnTxlQnUEZYu3at9u3bp2LFipkeWZZSp8F+/fXXaUaqndNrM1OHDh00ffp01/rbtm3bavbs2aZiJSQk6Ntvv3W9X61atXLr4uibb77RjBkz5Ovrq6tXr2rIkCGmR3ad36OzZ8/qscceM/U9klI7DV566aU0U9zdKST1zjvvKCgoSM8++6x+++03zZw50+06Ju7at2+fvv/+e+3YsUPPPvusmjRp4taIyPLly7V06VL169dPc+fOVbly5VS7du10x2nXrl2aDqqEhAS3pq46p0IbhqG//vpLuXPnNjU9OioqSuPHj9eBAwfk7++vQYMGpXuK+u2W2L366qvpbpNTSkqK1q5dq/3796tYsWKqU6eO6VhNmjRRbGys67HZjkFnAbYGDRooW7Zsru2zZ89W27Zt7yrGd999p1dffVWTJk266Tl3OsysEhMToz/++MOS3w7p5t/b0qVLm6rjIKUmiRMnTnRd47/11lumlz9aKSkpSXPnzlVUVJT8/f3VsmVLeXl5pTvOiRMnNHr0aB08eFD+/v7q06ePW7NSbjzvSP/rGE/vNaJzNqVhGPrzzz915swZ0+vGmzVrJj8/P7388svy9/dXqVKlTMWRrM37bnUuM3sOa9Wqlb766qs0A7PffvutqVj3RYIupVbhPXLkiAoVKqQ8efKk+/XXrzd2jjSkpKQoKSnJ9FrvPn36KG/evK7etYsXL2rUqFGmYkmpRTecP1rO6TCZxZlU35hAOxyOdF8w/Prrr6pRo4alX5KgoCCNGzcuTS2B+fPnm4r1+uuva9asWerfv78GDhyot956y63pNFa5vkDJvn371KVLF9MzNCTpzJkzrmTszJkzrtGo9FqzZo2GDBmibNmyKT4+XsOGDdNzzz1nKtaNo+fXfz/T6+LFi1q3bl2aY+zUqVO6YuzcuVNPP/30LTsJ3FmbZZVmzZqpefPmCgwMNL1G8vpYU6ZMSTNSPXny5HTFcBa1atas2U2Jr9lEpUePHnrllVe0YsUKVahQQTNmzEh3gT6nt99+27KCLlJqMvzFF18ob968Onv2rD744APNmzfPVKx58+bp66+/1t9//608efIoS5YsppKLd955R19++aWpNtyKc+2mk/P8aBdbtmzRzJkzderUqXS/90lJSfL09Lzl8iozHTfOTi7nCLWVnVyGYahTp07pWvpwu2VjZjumtm3bph07duiNN95Qr1699Pbbb7tV4MqK8/StYubKlcut5UMbNmzQ8ePH3S5IZXUsqwQFBVlyXXN9MUnnLAt3i59u2rRJpUuX1pEjR0zXc+rWrZsmTJhwy99ps53+zuK6Hh4emjdvnlvFdaOjo3X16lXXZ9Sd2Vw3dq74+vpq8eLFpuNdz93fzKioKP38889atWqV8uXLd8sOq7vlzPvMfiackpKStHPnzjTnHLOzZK0cmL0vprj/8ssvioiI0NWrV13b0nsiyIj1xiNGjNDcuXO1YcMG+fv7my4sJqUmKr/88ovi4+N18OBBrVy5MlN7XZ3Tc51r2tzhjOFOr/uNrKgl4ORcD3T16lVly5bNrfvGT5o0SbNnz04z2mr2x+FWBUrMJugff/yxtm3bpqtXr+rq1avy8/MznVhMmjRJ8+fPdytJ2bdvn86cOaOxY8eqT58+rgqioaGhpn9ounXrpiJFimjfvn3KmjWrqSl3zmmhtyp+ZocEferUqVq8eLHat2+vEiVKqEWLFqaniFpRTNJZB+L48eMKCAhQ2bJlVbNmTWXPnt1UmyRrl+hYWdBFSl1z7LxQeOSRR9ya1jlnzhxNmzbN7eJbefLk0aBBg9JMQU7v3USul5ycrL1796pUqVLau3evW22zUmxsrFasWKElS5bo6tWrpkZTgoODFRoaetPSDIfDYeq2qTfeMcWdavdS2gT7zJkzOn78eLpe7zwu52ia5P6Sk2HDhrkGHnr06KG+ffuantEiWXOedtqyZYtCQkLcvkuDlQWprIxlpVy5cmnGjBlpzoVmftOuL37aoEEDSe4Xk5w4caJmz55telafJNfvl9nrrVvp3bu3mjdvrmXLlrlVXHfgwIHatGmTHn74YbeLLUtyTUk3DEO7du3SsmXLTMe6/v06e/aszp07ZzrWnj17tH79em3atElS6lT89Prwww//saPN7J2bunTpctMsWbMJulXLHqT7JEEfP368+vXrp3z58pmO4VxvPHToUMva5enpedfTn+7EyordVsiIpNoKMTEx8vX1vamWgNmpVZL00ksvadKkSXryySfVsmVLtwpJrV69WqtXr04zTc4sqwqUSKkVkJcuXapBgwapZ8+e6t69u+lYViQply9f1tKlS3X+/HnXLbQcDoer8IlZQ4YMUb9+/TR8+HBT3813331X0s1rsNypCWGlfPnyqUOHDmrQoIHGjh2r9957z/R9wq0oJuk8J//222+KiorSqlWrNHDgQD388MOmqmJLqZ/1MmXKSJL69u1rKoZTdHR0mqTJ7BRk5wV2cnKyOnXqpEqVKmnHjh1uTZe3qviWc6q3OxdW1xs4cKD69++vM2fOKH/+/Bo2bJglcc368ccftXTpUv399996+eWXFRISYnpNvPMCr3PnzpoxY4ar09/sREMr75giyXW3hwsXLujRRx91FXq7W2ZvV3Y7np6ermryhQoVsqRgrLvnaadx48ZZcpcGKwtSWRnLSnny5NGePXu0Z88e1zYzCXpGFD91OBz64IMP0nQemF3+aMWAntPly5f14osvasaMGW4V1927d6+WL1/u1nt0vet/dypVquRWB9D1gxHe3t4aMWKE6Vht27ZVoUKF1LNnT9OF525cnuXscHRHbGzsTbNkzbJyYPa+SNBz5cpliyIbGcnKit33s86dO2v27NnKnz+/Tp48qXHjxqlYsWJu3arl+guEWrVqqUiRIqZjOavUW8HKnrrs2bPL4XDoypUryps3r1tFg6xIUipXrqzKlStr9+7dbk2XvFF8fLyuXLniOlazJkyYoPDwcCUmJuratWsqUqTIbW8p9m9ZtGiRvvvuO6WkpKh58+Zufe6tHKl29pxv3rxZkkwvU7Baw4YNb5qCbIZz1P360Xeza2etLr7VpUsX/fLLL9q/f7+KFi3qVmEeKfXfMi4uTp6enoqOjtYHH3xgevTVCj179nQt+9q3b58+/fRT13NmR1SsunWYlXdMkVJvXTVkyBAVLlxYV65cMT0N9lZrVM0mKY8//rjCwsL0zDPPaMeOHcqfP7+pONez6jxt1V0arCxIZWUsK40cOVKHDh3S0aNHVapUKdP/jhlR/NSd23HdyIoBPSdncd2yZcu6VVz3kUceUVxcnNvL0pxCQ0Nd3++zZ8+69RkbOXKk9u3bpwMHDqho0aJpCi6m1+bNmxUZGal169bpq6++0sMPP5zuzgNnrnere72bZeUsWSsHZu/pNejOW5CtWrVKBQoUUNmyZS2ZwmdHI0aMUPny5S2rBnu/6tChgy5evKgjR46kSQLcmTJkRY+rsxft0KFDSkxMVIkSJVz/jmYvIq0qUCKlJgW5cuXSuXPndOrUKR07dkwLFixIVwwriwZlxNrlZcuW6ciRI8qTJ48mTZqkihUrprmYT4/XXntN4eHhGjFihN566y2FhIRYupbZrFGjRqlFixa2SYCdKlWq5HbPeUa4cZ39N998Y3pk3ypWF98KDQ3VkSNHVLFiRW3dulWFChVScHCw6fY1atRIkydPTjOby92qw+643QwRsx33zkKE7oqNjdXRo0eVL18+ffXVV6pdu7Zbt4OzqsbBwYMHJaXODNi9e7f27NljOqGKj49XRESEDh06pOLFi7tdaNHK83T//v2VN29e/fLLL2rcuLGioqI0evTodMexsiCVlbGsNGvWLK1YsUKXLl3Sq6++qiNHjpguBmY157XOgQMHVKRIEQUFBZn+jL355pumi/ze6Pfff9fKlStNF9dt1aqVHA6Hzp07pytXrqhQoUKS3LteldL+hmTNmlU1atSQr6+vqVgzZ87UkiVLXHesadCggam7kkipM9bWrVunNWvW6Pjx43r22WdN33awXbt2atiwoSpUqKDIyEitXbs23fd6d5o9e7YuXLggb29vrVy5Ug899FCmFz6V7vER9LNnz0pKXSPx/vvvu6bwxcfHZ2azMsRff/2VZuqRu9Vg71fTpk1zVZkcPHiwJTGt6HHdsGGDxo8fb0l7nKzsqVu3bp0KFiyoRo0aydvb29RaLyuXO2TE2uVLly5p8eLFrnX227dvNx0rd+7c8vb2VlxcnAoXLpym8yYzderUSevXr9fOnTstK7BkBSt6zjOC1VOQrWD1sqEtW7a4fivat2+vli1buhXPynvGWsHK2XNWz16wcjmGZF2Ng+vXfvr7+5uqMuwsmLllyxYVL17cNc39t99+c6seh5Xn6ZCQEM2fP1+VK1dW9uzZTS9hbNCggZ5//nm3ChFnRCwrLV26VOHh4XrjjTfUvn17S0et3TVo0CDlzJlT1atX12+//aYBAwZozJgx6YrhHNDz9vbWwIED3RrQc96yLE+ePGrRooXOnz9v6g4bYWFhMgxDzZs3V/Xq1VW2bFnVqlXL7dsWW/kbsmTJEs2ePVuenp5KTExU69atTSfo77zzjurWravOnTurRIkSbrctKChIkvl7vTtZOUvWSvd0gl6gQAEtWLBADz30kGvth7Pyujvz/u3IqqmY97ssWbLo0UcfTVdl2zuxYglF8eLFbb0MY+HCha41wj///LPb1TXdlRFrl62atiqlTuNznntCQ0PT3MonM3Xv3t2yAktWunz5sk6fPq2///5b165dc/t+41axegqyHSUlJSklJUVZsmRJUxzMLCvvGWs3t1qqYAdW1zhwJitSav0MM1NzM6pgppXnaYfDIU9PT+XJk0clSpRQbGysqWrPVq5btjKWlZyTaZ3f6cycFXOjI0eOuAoP1q1b19RtIq0c0HPOLLjxXJregTNnjZHNmze7rnMGDBjg1nWO1Zy3WJUkLy8v07M0pdTrTKsUK1ZM33//vapWrer2vd7t+p28p7O8wMBAPf/885oyZYplldftyupqsLgzK3tcjx8//o8jhna4uL1xjbCZ6poZwcq1y1YV3ZKkrl27KiYmRk8//bQCAwNtU+hHsq7AkpWs7jm3ipXr7O2qYcOGCgoKUvny5bVjx440hZvMsNMSBavZreipk5U1DqT/JStS6hTYcePGpTvG9QUzo6Ojde3aNdPtuZ6V5+lBgwYpf/782rBhg5566ikFBwdr2rRp6Y5j5bplK2NZqWHDhnr99dd14sQJdezY0e1aFVaKj4/X1atX9dBDD+nq1atKTk5OdwwrB/Sct5m81W0UzbBrjRYpdXlat27dVKlSJUVGRpq+/a7VDh48qIMHD2rBggWuzqVBgwaZute7Xb+T93SC7u3trSeeeMLSyut2ZcepmPc7K3tcs2XLZrtRmetZUV0zI1jRLqunrUqp01U7deqk8PBwhYaGKiwsLM29oTOTVQWWrGRlz7mVrJ6CbEcvvPCCAgICdPDgQb322msqWbKkW/HsmsTez6x+z28sHGi26r2UelG8ceNGt28RlRHn6aNHj2r48OHaunWrXnzxRdMz66wsRGzXosaLFi2Sn5+f2rZtK39/f5UqVSqzm+TivIVs8eLFdeDAAXXr1i3dMTJiQM+qgTO7Xn9Jqbee/OWXX3Tw4EE1b97cNu2bOXOmLly4oGPHjrl9H3S7fifv6QT9QfIgTMW0Gyt7XPPly2frC1u7rhG2ol0ZMW01KSlJzz77rKZMmaJGjRopPDzcstjuaNu2rWbMmKGAgAC98MILqlixYmY3CZmsf//+ioiIcDsxx/3j+sKBixYt0tatW013UO3Zs8eSW0RlxHk6OTlZ0dHRcjgcio2NTXc1aytn0VkZKyM4l7n9/PPP+uabbzJ9mdv1smfPrqJFiyouLk6PP/64Fi1apEaNGqUrRkYM6Fk1cGbX6y8p9ZaOmzZt0qFDS4Xi3wAABzFJREFUh3ThwgVVrFjRdME5K/34448aN26c/P39tX//fnXp0kWBgYHpimH37yQJ+j3iQZiKaTdW9rg+9dRTVjfPUnZdI2xFuzKiYyQxMVEjR45U5cqVtWnTJlNT7jKClQWWcG+LiYmRr6+vsmfPrhEjRqS5h7AdLj6QeawsHGjVLaIy4jzds2dPtWnTRqdOnVLr1q318ccfp+v1Vs6is3tRY+c0602bNkmyzzI3SRozZoyGDh2qnDlzZnZT0rBq4Myu119S6gh67dq11bRpU23dulXBwcGaPHlyZjdLX3/9tRYuXKgcOXIoNjZW7du3T3eCbvfvJAn6PeJBmIppN1b2uLpzW6N/g13XCNu1XaNGjdL69evVokULrVy5UmPHjs3sJkmytsAS7m2dO3fW7Nmz9f/au7uQJts4juO/bWIHDrMmZdgbVhDRYUs0OtBKgghiUNBBKPTCDoQ6iAorAyGKDiohJBB7QciiwIOOhJ0IvRw0KIwIoiJHIbpckqvmy+ZzIJnC0wNrV899ee/7OZpMbv732GD/Xdf1/5WXl6u4uFjDw8NOlwRLmBgcODsiqq6uzlhElEnfv3/XxMSEysvLlUqlsl5BN7mLzvahxjZvs163bp2VW5BNLZzZ+j1Hmm5WZ09L7+npcbiiaR6PR0VFRZKm+6MFCxZkfQ3bP5PzOgcdAPCLqfxmzH8HDx7UyMiI+vv75wwdsqmBgjPa29sViURmBgfu3LlTDQ0NWV3j06dPv42IMjXkLVe55saPj48rHo//6y66bKecm7zW3zA5OTmzzbqvr8+qbdbd3d26e/funFX9CxcuOFiR+/2ciN7a2qq6ujoFg0H19fUpEolY8dqfOHFCixcv1qZNmxSNRjUyMqKLFy9mdQ3bP5M06AAwz/38IvX8+XMVFhYaGbCE+S2TyWhoaEjNzc06d+7cnOdsaaDgjFAopOXLl2vXrl2qqKjIedVudkSnTRFRDQ0NunXr1szf9fX1un37tnMFWSyRSOjRo0fq7e3Vx48fFQwGdfz4cafLkjT9fj106NCcs89bt251sCL3O3DgwMzj2Ttsso2S+1smJyd17949vX//XhUVFdq3b19OEXA2Yos7AMxztuY3wzler1dlZWV/PLka7jV7INjNmzdzGghmY0SU6dz4fGDzNuvS0tKc4yGRHdNRcqZ5PB4VFhZq4cKFWrt2rUZHR3Oa5G4jO15pAMAfszkhAIBdTA4Es/Hssunc+HxgaxSmNB1Tayp6D9kxFSVnWnNzs5YsWaInT55o48aNOnnypNrb250uyygadAAAgDxhsqm2MSKKHyzdpaamxukS8papKDnTYrGYzp8/r2g0qtraWlfuFKNBBwAAyBMmm2qbI6LgDvzg4hxTUXKmpdNpJRIJeTweJZPJrBMa5gOGxAEAAOQJkwPBQqGQtm/frh07dlh3dhlAbpLJpGKxmEpLS3Xjxg3V1NSosrLS6bL07NkzNTU1aWBgQKtXr1ZTU5Oqq6udLssoVtABAADyhMmBYDafXQaQG7/frw0bNkiSTp065XA1v3z58kXpdFqrVq1SKpVSJpNxuiTjWEEHAAAAAFhvz5496ujoUCAQ0OfPnxUOh/XgwQOnyzLKfZv2AQAAAACuU1JSokAgIGk6hs/v9ztckXmsoAMAAAAArNfY2KgfP34oGAzq1atXisfj2rx5syT3RPBxBh0AAAAAYL1t27bNPF66dKmDlfw9rKADAAAAAGABzqADAAAAAGABGnQAAAAAACxAgw4AgIuNjY3p/v37v32+trZWY2Nj/2NFAADgd2jQAQBwsXg8/p8NOgAAsAdT3AEAcLHr16/r7du3unbtml6+fKlkMql0Oq2jR4+qqqpq5v+6urr0+PFjXb58WS9evNCVK1fk8/m0YsUKtbS06OHDh+rt7VUqlVIsFtPhw4cVCoUcvDMAANyHBh0AABcLh8N68+aNvn37purqatXX12twcFD79+9XJBKRJHV2dur169dqbW2V1+vV2bNndefOHQUCAV29elXd3d0qKChQMplUR0eHPnz4oHA4TIMOAIBhNOgAAOSBd+/eaffu3ZKms2P9fr8SiYQk6enTp/L5fPL5fBoeHtbQ0JCOHTsmSUqlUtqyZYtWrlyp9evXS5KWLVum8fFxZ24EAAAX4ww6AAAu5vV6lclktGbNGkWjUUnS4OCgvn79qpKSEklSW1ubiouL1dXVpUWLFqmsrExtbW3q7OxUOBxWZWWlJMnj8Th2HwAA5ANW0AEAcLFAIKCJiQmNjo6qv79fPT09SqVSamlpUUHBr68BZ86c0d69e1VVVaXTp0/ryJEjmpqaUlFRkS5duqSBgQEH7wIAgPzgmZqamnK6CAAAAAAA8h1b3AEAAAAAsAANOgAAAAAAFqBBBwAAAADAAjToAAAAAABYgAYdAAAAAAAL0KADAAAAAGABGnQAAAAAACzwDzuMpufzct9GAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 1008x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "n = 50\n",
    "(tokens\n",
    " .iloc[:50]\n",
    " .plot\n",
    " .bar(figsize=(14, 4), title=f'Most frequent {n} of {len(tokens):,d} tokens'))\n",
    "sns.despine()\n",
    "plt.tight_layout();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Document-Term Matrix with `CountVectorizer`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The scikit-learn preprocessing module offers two tools to create a document-term matrix. The [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) uses binary or absolute counts to measure the term frequency tf(d, t) for each document d and token t.\n",
    "\n",
    "The [TfIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), in contrast, weighs the (absolute) term frequency by the inverse document frequency (idf). As a result, a term that appears in more documents will receive a lower weight than a token with the same frequency for a given document but lower frequency across all documents. \n",
    "\n",
    "The resulting tf-idf vectors for each document are normalized with respect to their absolute or squared totals (see the sklearn documentation for details). The tf-idf measure was originally used in information retrieval to rank search engine results and has subsequently proven useful for text classification or clustering."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both tools use the same interface and perform tokenization and further optional preprocessing of a list of documents before vectorizing the text by generating token counts to populate the document-term matrix.\n",
    "\n",
    "Key parameters that affect the size of the vocabulary include:\n",
    "\n",
    "- `stop_words`: use a built-in or provide a list of (frequent) words to exclude\n",
    "- `ngram_range`: include n-grams in a range for n defined by a tuple of (nmin, nmax)\n",
    "- `lowercase`: convert characters accordingly (default is True)\n",
    "- `min_df `/ max_df: ignore words that appear in less / more (int) or a smaller / larger share of documents (if float [0.0,1.0])\n",
    "- `max_features`: limit number of tokens in vocabulary accordingly\n",
    "- `binary`: set non-zero counts to 1 True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Key parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.267603Z",
     "start_time": "2020-06-20T17:16:31.265326Z"
    },
    "scrolled": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Convert a collection of text documents to a matrix of token counts\n",
      "\n",
      "    This implementation produces a sparse representation of the counts using\n",
      "    scipy.sparse.csr_matrix.\n",
      "\n",
      "    If you do not provide an a-priori dictionary and you do not use an analyzer\n",
      "    that does some kind of feature selection then the number of features will\n",
      "    be equal to the vocabulary size found by analyzing the data.\n",
      "\n",
      "    Read more in the :ref:`User Guide <text_feature_extraction>`.\n",
      "\n",
      "    Parameters\n",
      "    ----------\n",
      "    input : string {'filename', 'file', 'content'}, default='content'\n",
      "        If 'filename', the sequence passed as an argument to fit is\n",
      "        expected to be a list of filenames that need reading to fetch\n",
      "        the raw content to analyze.\n",
      "\n",
      "        If 'file', the sequence items must have a 'read' method (file-like\n",
      "        object) that is called to fetch the bytes in memory.\n",
      "\n",
      "        Otherwise the input is expected to be a sequence of items that\n",
      "        can be of type string or byte.\n",
      "\n",
      "    encoding : string, default='utf-8'\n",
      "        If bytes or files are given to analyze, this encoding is used to\n",
      "        decode.\n",
      "\n",
      "    decode_error : {'strict', 'ignore', 'replace'}, default='strict'\n",
      "        Instruction on what to do if a byte sequence is given to analyze that\n",
      "        contains characters not of the given `encoding`. By default, it is\n",
      "        'strict', meaning that a UnicodeDecodeError will be raised. Other\n",
      "        values are 'ignore' and 'replace'.\n",
      "\n",
      "    strip_accents : {'ascii', 'unicode'}, default=None\n",
      "        Remove accents and perform other character normalization\n",
      "        during the preprocessing step.\n",
      "        'ascii' is a fast method that only works on characters that have\n",
      "        an direct ASCII mapping.\n",
      "        'unicode' is a slightly slower method that works on any characters.\n",
      "        None (default) does nothing.\n",
      "\n",
      "        Both 'ascii' and 'unicode' use NFKD normalization from\n",
      "        :func:`unicodedata.normalize`.\n",
      "\n",
      "    lowercase : bool, default=True\n",
      "        Convert all characters to lowercase before tokenizing.\n",
      "\n",
      "    preprocessor : callable, default=None\n",
      "        Override the preprocessing (string transformation) stage while\n",
      "        preserving the tokenizing and n-grams generation steps.\n",
      "        Only applies if ``analyzer is not callable``.\n",
      "\n",
      "    tokenizer : callable, default=None\n",
      "        Override the string tokenization step while preserving the\n",
      "        preprocessing and n-grams generation steps.\n",
      "        Only applies if ``analyzer == 'word'``.\n",
      "\n",
      "    stop_words : string {'english'}, list, default=None\n",
      "        If 'english', a built-in stop word list for English is used.\n",
      "        There are several known issues with 'english' and you should\n",
      "        consider an alternative (see :ref:`stop_words`).\n",
      "\n",
      "        If a list, that list is assumed to contain stop words, all of which\n",
      "        will be removed from the resulting tokens.\n",
      "        Only applies if ``analyzer == 'word'``.\n",
      "\n",
      "        If None, no stop words will be used. max_df can be set to a value\n",
      "        in the range [0.7, 1.0) to automatically detect and filter stop\n",
      "        words based on intra corpus document frequency of terms.\n",
      "\n",
      "    token_pattern : string\n",
      "        Regular expression denoting what constitutes a \"token\", only used\n",
      "        if ``analyzer == 'word'``. The default regexp select tokens of 2\n",
      "        or more alphanumeric characters (punctuation is completely ignored\n",
      "        and always treated as a token separator).\n",
      "\n",
      "    ngram_range : tuple (min_n, max_n), default=(1, 1)\n",
      "        The lower and upper boundary of the range of n-values for different\n",
      "        word n-grams or char n-grams to be extracted. All values of n such\n",
      "        such that min_n <= n <= max_n will be used. For example an\n",
      "        ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means\n",
      "        unigrams and bigrams, and ``(2, 2)`` means only bigrams.\n",
      "        Only applies if ``analyzer is not callable``.\n",
      "\n",
      "    analyzer : string, {'word', 'char', 'char_wb'} or callable,             default='word'\n",
      "        Whether the feature should be made of word n-gram or character\n",
      "        n-grams.\n",
      "        Option 'char_wb' creates character n-grams only from text inside\n",
      "        word boundaries; n-grams at the edges of words are padded with space.\n",
      "\n",
      "        If a callable is passed it is used to extract the sequence of features\n",
      "        out of the raw, unprocessed input.\n",
      "\n",
      "        .. versionchanged:: 0.21\n",
      "\n",
      "        Since v0.21, if ``input`` is ``filename`` or ``file``, the data is\n",
      "        first read from the file and then passed to the given callable\n",
      "        analyzer.\n",
      "\n",
      "    max_df : float in range [0.0, 1.0] or int, default=1.0\n",
      "        When building the vocabulary ignore terms that have a document\n",
      "        frequency strictly higher than the given threshold (corpus-specific\n",
      "        stop words).\n",
      "        If float, the parameter represents a proportion of documents, integer\n",
      "        absolute counts.\n",
      "        This parameter is ignored if vocabulary is not None.\n",
      "\n",
      "    min_df : float in range [0.0, 1.0] or int, default=1\n",
      "        When building the vocabulary ignore terms that have a document\n",
      "        frequency strictly lower than the given threshold. This value is also\n",
      "        called cut-off in the literature.\n",
      "        If float, the parameter represents a proportion of documents, integer\n",
      "        absolute counts.\n",
      "        This parameter is ignored if vocabulary is not None.\n",
      "\n",
      "    max_features : int, default=None\n",
      "        If not None, build a vocabulary that only consider the top\n",
      "        max_features ordered by term frequency across the corpus.\n",
      "\n",
      "        This parameter is ignored if vocabulary is not None.\n",
      "\n",
      "    vocabulary : Mapping or iterable, default=None\n",
      "        Either a Mapping (e.g., a dict) where keys are terms and values are\n",
      "        indices in the feature matrix, or an iterable over terms. If not\n",
      "        given, a vocabulary is determined from the input documents. Indices\n",
      "        in the mapping should not be repeated and should not have any gap\n",
      "        between 0 and the largest index.\n",
      "\n",
      "    binary : bool, default=False\n",
      "        If True, all non zero counts are set to 1. This is useful for discrete\n",
      "        probabilistic models that model binary events rather than integer\n",
      "        counts.\n",
      "\n",
      "    dtype : type, default=np.int64\n",
      "        Type of the matrix returned by fit_transform() or transform().\n",
      "\n",
      "    Attributes\n",
      "    ----------\n",
      "    vocabulary_ : dict\n",
      "        A mapping of terms to feature indices.\n",
      "\n",
      "    fixed_vocabulary_: boolean\n",
      "        True if a fixed vocabulary of term to indices mapping\n",
      "        is provided by the user\n",
      "\n",
      "    stop_words_ : set\n",
      "        Terms that were ignored because they either:\n",
      "\n",
      "          - occurred in too many documents (`max_df`)\n",
      "          - occurred in too few documents (`min_df`)\n",
      "          - were cut off by feature selection (`max_features`).\n",
      "\n",
      "        This is only available if no vocabulary was given.\n",
      "\n",
      "    Examples\n",
      "    --------\n",
      "    >>> from sklearn.feature_extraction.text import CountVectorizer\n",
      "    >>> corpus = [\n",
      "    ...     'This is the first document.',\n",
      "    ...     'This document is the second document.',\n",
      "    ...     'And this is the third one.',\n",
      "    ...     'Is this the first document?',\n",
      "    ... ]\n",
      "    >>> vectorizer = CountVectorizer()\n",
      "    >>> X = vectorizer.fit_transform(corpus)\n",
      "    >>> print(vectorizer.get_feature_names())\n",
      "    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n",
      "    >>> print(X.toarray())\n",
      "    [[0 1 1 1 0 0 1 0 1]\n",
      "     [0 2 0 1 0 1 1 0 1]\n",
      "     [1 0 0 1 1 0 1 1 1]\n",
      "     [0 1 1 1 0 0 1 0 1]]\n",
      "    >>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))\n",
      "    >>> X2 = vectorizer2.fit_transform(corpus)\n",
      "    >>> print(vectorizer2.get_feature_names())\n",
      "    ['and this', 'document is', 'first document', 'is the', 'is this',\n",
      "    'second document', 'the first', 'the second', 'the third', 'third one',\n",
      "     'this document', 'this is', 'this the']\n",
      "     >>> print(X2.toarray())\n",
      "     [[0 0 1 1 0 0 1 0 0 0 0 1 0]\n",
      "     [0 1 0 1 0 1 0 1 0 0 1 0 0]\n",
      "     [1 0 0 1 0 0 0 0 1 1 0 1 0]\n",
      "     [0 0 1 0 1 0 1 0 0 0 0 0 1]]\n",
      "\n",
      "    See Also\n",
      "    --------\n",
      "    HashingVectorizer, TfidfVectorizer\n",
      "\n",
      "    Notes\n",
      "    -----\n",
      "    The ``stop_words_`` attribute can get large and increase the model size\n",
      "    when pickling. This attribute is provided only for introspection and can\n",
      "    be safely removed using delattr or set to None before pickling.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(CountVectorizer().__doc__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Document Frequency Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.688693Z",
     "start_time": "2020-06-20T17:16:31.269315Z"
    }
   },
   "outputs": [],
   "source": [
    "binary_vectorizer = CountVectorizer(max_df=1.0,\n",
    "                                    min_df=1,\n",
    "                                    binary=True)\n",
    "\n",
    "binary_dtm = binary_vectorizer.fit_transform(docs.body)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.692350Z",
     "start_time": "2020-06-20T17:16:31.689957Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2225x29275 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 445870 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "binary_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.701870Z",
     "start_time": "2020-06-20T17:16:31.693172Z"
    }
   },
   "outputs": [],
   "source": [
    "n_docs, n_tokens = binary_dtm.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.720141Z",
     "start_time": "2020-06-20T17:16:31.702751Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens_dtm = binary_vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### CountVectorizer skips certain tokens by default"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.777417Z",
     "start_time": "2020-06-20T17:16:31.721088Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['!', '\"', '\"\"unconscionable,', '\"'I', '\"'Oh', '\"'We', '\"'You', '\"(When',\n",
       "       '\"...it', '\"100%',\n",
       "       ...\n",
       "       'Â£900m', 'Â£910m).', 'Â£93.6bn)', 'Â£933m', 'Â£947m', 'Â£960m',\n",
       "       'Â£98)', 'Â£99', 'Â£9m', 'Â£9m,'],\n",
       "      dtype='object', length=47927)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens.index.difference(pd.Index(tokens_dtm))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Persist Result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_path = Path('results', 'bbc')\n",
    "if not results_path.exists():\n",
    "    results_path.mkdir(parents=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.794267Z",
     "start_time": "2020-06-20T17:16:31.778272Z"
    }
   },
   "outputs": [],
   "source": [
    "dtm_path = results_path / 'binary_dtm.npz'\n",
    "if not dtm_path.exists():\n",
    "    sparse.save_npz(dtm_path, binary_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.807461Z",
     "start_time": "2020-06-20T17:16:31.795142Z"
    }
   },
   "outputs": [],
   "source": [
    "token_path = results_path / 'tokens.csv'\n",
    "if not token_path.exists():\n",
    "    pd.Series(tokens_dtm).to_csv(token_path, index=False)\n",
    "else:\n",
    "    tokens = pd.read_csv(token_path, header=None, squeeze=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:31.812806Z",
     "start_time": "2020-06-20T17:16:31.808310Z"
    }
   },
   "outputs": [],
   "source": [
    "doc_freq = pd.Series(np.array(binary_dtm.sum(axis=0)).squeeze()).div(n_docs)\n",
    "max_unique_tokens = np.array(binary_dtm.sum(axis=1)).squeeze().max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `min_df` vs `max_df`: Interactive Visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook contains an interactive visualization that explores the impact of the min_df and max_df settings on the size of the vocabulary. We read the articles into a DataFrame, set the CountVectorizer to produce binary flags and use all tokens, and call its .fit_transform() method to produce a document-term matrix:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The visualization shows that requiring tokens to appear in at least 1%  and less than 50% of documents restricts the vocabulary to around 10% of the almost 30K tokens. \n",
    "This leaves a mode of slightly over 100 unique tokens per document (left panel), and the right panel shows the document frequency histogram for the remaining tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:16:32.563441Z",
     "start_time": "2020-06-20T17:16:31.813611Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "45cfe0deb8c24b06b3b39a01dcf49657",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "interactive(children=(FloatRangeSlider(value=(0.0, 1.0), description='Doc. Freq.', layout=Layout(width='800px'…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_range = FloatRangeSlider(value=[0.0, 1.0],\n",
    "                            min=0,\n",
    "                            max=1,\n",
    "                            step=0.0001,\n",
    "                            description='Doc. Freq.',\n",
    "                            disabled=False,\n",
    "                            continuous_update=True,\n",
    "                            orientation='horizontal',\n",
    "                            readout=True,\n",
    "                            readout_format='.1%',\n",
    "                            layout={'width': '800px'})\n",
    "\n",
    "@interact(df_range=df_range)\n",
    "def document_frequency_simulator(df_range):\n",
    "    min_df, max_df = df_range\n",
    "    keep = doc_freq.between(left=min_df, right=max_df)\n",
    "    left = keep.sum()\n",
    "\n",
    "    fig, axes = plt.subplots(ncols=2, figsize=(14, 6))\n",
    "\n",
    "    updated_dtm = binary_dtm.tocsc()[:, np.flatnonzero(keep)]\n",
    "    unique_tokens_per_doc = np.array(updated_dtm.sum(axis=1)).squeeze()\n",
    "    sns.distplot(unique_tokens_per_doc, ax=axes[0], kde=False, norm_hist=False)\n",
    "    axes[0].set_title('Unique Tokens per Doc')\n",
    "    axes[0].set_yscale('log')\n",
    "    axes[0].set_xlabel('# Unique Tokens')\n",
    "    axes[0].set_ylabel('# Documents (log scale)')\n",
    "    axes[0].set_xlim(0, max_unique_tokens)    \n",
    "    axes[0].yaxis.set_major_formatter(ScalarFormatter())\n",
    "\n",
    "    term_freq = pd.Series(np.array(updated_dtm.sum(axis=0)).squeeze())\n",
    "    sns.distplot(term_freq, ax=axes[1], kde=False, norm_hist=False)\n",
    "    axes[1].set_title('Document Frequency')\n",
    "    axes[1].set_ylabel('# Tokens')\n",
    "    axes[1].set_xlabel('# Documents')\n",
    "    axes[1].set_yscale('log')\n",
    "    axes[1].set_xlim(0, n_docs)\n",
    "    axes[1].yaxis.set_major_formatter(ScalarFormatter())\n",
    "\n",
    "    title = f'Document/Term Frequency Distribution | # Tokens: {left:,d} ({left/n_tokens:.2%})'\n",
    "    fig.suptitle(title, fontsize=14)\n",
    "    sns.despine()\n",
    "    fig.tight_layout()\n",
    "    fig.subplots_adjust(top=.9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Most similar documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The CountVectorizer result lets us find the most similar documents using the `pdist()` function for pairwise distances provided by the `scipy.spatial.distance` module. \n",
    "\n",
    "It returns a  condensed distance matrix with entries corresponding to the upper triangle of a square matrix. \n",
    "\n",
    "We use `np.triu_indices()` to translate the index that minimizes the distance to the row and column indices that in turn correspond to the closest token vectors. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:12.679717Z",
     "start_time": "2020-06-20T17:16:32.564306Z"
    }
   },
   "outputs": [],
   "source": [
    "m = binary_dtm.todense()\n",
    "pairwise_distances = pdist(m, metric='cosine')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:12.684227Z",
     "start_time": "2020-06-20T17:17:12.680626Z"
    }
   },
   "outputs": [],
   "source": [
    "closest = np.argmin(pairwise_distances)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:12.714815Z",
     "start_time": "2020-06-20T17:17:12.685219Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6, 245)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows, cols = np.triu_indices(n_docs)\n",
    "rows[closest], cols[closest]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:12.719802Z",
     "start_time": "2020-06-20T17:17:12.715674Z"
    }
   },
   "outputs": [],
   "source": [
    "docs.iloc[6].to_frame(6).join(docs.iloc[245].to_frame(245)).to_csv(results_path / 'most_similar.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:12.732487Z",
     "start_time": "2020-06-20T17:17:12.720618Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "topic                                               business\n",
       "heading                     Jobs growth still slow in the US\n",
       "body       The US created fewer jobs than expected in Jan...\n",
       "Name: 6, dtype: object"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs.iloc[6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:12.743194Z",
     "start_time": "2020-06-20T17:17:12.733652Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    28972\n",
       "1      265\n",
       "2       38\n",
       "dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(binary_dtm[[6, 245], :].todense()).sum(0).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Baseline document-term matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:13.148927Z",
     "start_time": "2020-06-20T17:17:12.744120Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2225x29275 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 445870 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Baseline: number of unique tokens\n",
    "vectorizer = CountVectorizer() # default: binary=False\n",
    "doc_term_matrix = vectorizer.fit_transform(docs.body)\n",
    "doc_term_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:13.152124Z",
     "start_time": "2020-06-20T17:17:13.149757Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2225, 29275)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_term_matrix.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Inspect tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:13.188543Z",
     "start_time": "2020-06-20T17:17:13.153164Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['00',\n",
       " '000',\n",
       " '0001',\n",
       " '000bn',\n",
       " '000m',\n",
       " '000s',\n",
       " '000th',\n",
       " '001',\n",
       " '001and',\n",
       " '001st']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# vectorizer keeps words\n",
    "words = vectorizer.get_feature_names()\n",
    "words[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Inspect doc-term matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:21.795737Z",
     "start_time": "2020-06-20T17:17:13.191151Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>00</th>\n",
       "      <th>000</th>\n",
       "      <th>0001</th>\n",
       "      <th>000bn</th>\n",
       "      <th>000m</th>\n",
       "      <th>000s</th>\n",
       "      <th>000th</th>\n",
       "      <th>001</th>\n",
       "      <th>001and</th>\n",
       "      <th>001st</th>\n",
       "      <th>...</th>\n",
       "      <th>zooms</th>\n",
       "      <th>zooropa</th>\n",
       "      <th>zornotza</th>\n",
       "      <th>zorro</th>\n",
       "      <th>zubair</th>\n",
       "      <th>zuluaga</th>\n",
       "      <th>zurich</th>\n",
       "      <th>zutons</th>\n",
       "      <th>zvonareva</th>\n",
       "      <th>zvyagintsev</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 29275 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   00  000  0001  000bn  000m  000s  000th  001  001and  001st  ...  zooms  \\\n",
       "0   0    1     0      0     0     0      0    0       0      0  ...      0   \n",
       "1   0    0     0      0     0     0      0    0       0      0  ...      0   \n",
       "2   0    0     0      0     0     0      0    0       0      0  ...      0   \n",
       "3   0    1     0      0     0     0      0    0       0      0  ...      0   \n",
       "4   0    0     0      0     0     0      0    0       0      0  ...      0   \n",
       "\n",
       "   zooropa  zornotza  zorro  zubair  zuluaga  zurich  zutons  zvonareva  \\\n",
       "0        0         0      0       0        0       0       0          0   \n",
       "1        0         0      0       0        0       0       0          0   \n",
       "2        0         0      0       0        0       0       0          0   \n",
       "3        0         0      0       0        0       0       0          0   \n",
       "4        0         0      0       0        0       0       0          0   \n",
       "\n",
       "   zvyagintsev  \n",
       "0            0  \n",
       "1            0  \n",
       "2            0  \n",
       "3            0  \n",
       "4            0  \n",
       "\n",
       "[5 rows x 29275 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# from scipy compressed sparse row matrix to sparse DataFrame\n",
    "doc_term_matrix_df = pd.DataFrame.sparse.from_spmatrix(doc_term_matrix, columns=words)\n",
    "doc_term_matrix_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Most frequent terms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:22.314567Z",
     "start_time": "2020-06-20T17:17:21.796620Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "the    52574\n",
       "to     24767\n",
       "of     19930\n",
       "and    18574\n",
       "in     17553\n",
       "dtype: int64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_freq = doc_term_matrix_df.sum(axis=0).astype(int)\n",
    "word_freq.sort_values(ascending=False).head() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Compute relative term frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:22.724293Z",
     "start_time": "2020-06-20T17:17:22.315343Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2225, 29275)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(binary=True)\n",
    "doc_term_matrix = vectorizer.fit_transform(docs.body)\n",
    "doc_term_matrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:22.750166Z",
     "start_time": "2020-06-20T17:17:22.725438Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "the     1.000000\n",
       "to      0.995056\n",
       "of      0.991461\n",
       "and     0.991011\n",
       "in      0.990562\n",
       "for     0.930337\n",
       "on      0.906517\n",
       "is      0.862472\n",
       "it      0.858427\n",
       "said    0.848539\n",
       "dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = vectorizer.get_feature_names()\n",
    "word_freq = doc_term_matrix.sum(axis=0)\n",
    "\n",
    "# reduce to 1D array\n",
    "word_freq_1d = np.squeeze(np.asarray(word_freq))\n",
    "\n",
    "pd.Series(word_freq_1d, index=words).div(\n",
    "    docs.shape[0]).sort_values(ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Visualize Doc-Term Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:58.856629Z",
     "start_time": "2020-06-20T17:17:22.751016Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAu4AAAIqCAYAAAB7ZM9oAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nOzde3xU9Z3/8fckkBATEPCytUgUL/FSRUBAXYuu3cXYWlcFISQ4lCKy1UoREC9AgBa5efu1Bq8UaxuFhCKtoK4XEMHbIiAXoSALChXWVSxRmAAJIfP7w2UkQCZzOTPn+z3n9fQxj4eTGc58z/l+vt/v53znO+cEwuFwWAAAAACMluF2AQAAAAA0jcQdAAAAsACJOwAAAGABEncAAADAAiTuAAAAgAVI3AEAAAALpDxxr6+v17hx41RUVKRgMKht27al+iMBAAAAY6xZs0bBYPCov7/55pvq3bu3ioqKNGfOnCa30ywVhTvcwoULVVtbq8rKSq1evVpTp07VE088keqPBQAAAFw3Y8YMzZ8/Xzk5OQ3+fuDAAU2ZMkVz585VTk6OiouLddVVV+mkk05qdFspn3FfuXKlevToIUnq1KmT1q1bl+qPBAAAAIyQn5+vsrKyo/6+ZcsW5efn6/jjj1dWVpYuvvhirVixIuq2Uj7jHgqFlJeXF3memZmpuro6NWvW+Efvr4vvM9p0u0NVy6cnWkTX2FpuP6KuAO+ifQOxa6q9tEh5Zpm4nM53OL7NZ+/tocrKysjzoqIiFRUVNXhPYWGhtm/fftS/DYVCatmyZeR5bm6uQqFQ1M9L+Yx7Xl6eqqurI8/r6+ujJu2JsLXDtbXcfkRdfdtZO/k+2M2Eeo6lDLG8h/YNIFFFRUWaN29e5HFk0h7NkTlydXV1g0T+WFKeuHfp0kVLly6VJK1evVoFBQWp/kgAKRBrckMS5A8m1HMsZTChnG4z4SQL3mF1mwpkOP9Iwplnnqlt27bp66+/Vm1trVasWKHOnTtH/Tcp/0KjZ8+eevfdd9WvXz+Fw2FNnjw51R8JAAD+j9WJFuBBCxYs0N69e1VUVKR7771Xt9xyi8LhsHr37q1/+qd/ivpvA+FwOJymcsYs3jXusB9rTGEi4tJ+1CHQNCfbidFr3C8e5vg29638nePbjIYbMMEIDKze4aWv5YlLu3kpFpEcYiG6ePo6jqW7SNwBH6HDRaqYGFtVy6en9OTLxH3GsXESDknGrXFPaBdYKpN+fHULAAC8yOilMl2HO77NfSv+n+PbjIYZdxeQtAOxY1YTsB/tGEYIBJx/pJmxiTuNHGjIr23ClBNdvx7/WJl6fFJZLlP32USmtGMkz+q498BSGWMTdxo50BBtwl2mHX/TBs9kj0+q9se0egOAZBibuMMZpg3usBexZBavJaQ27k/V8um0C/iOjW01gqUyqUNn6AyrGxgaiNYm0tFe3Iol+gIzUS/foo+NjYnxYmKZbMBxcxdXlQEAAEBMmroyntFXlbn0Hse3ue+/pjm+zWiMnXEH0JDbM+5u8fK+2cyWerGlnADSwANLZZhxR0y49jxgFtokABMZPeN+2b2Ob3Pf+1Md32Y0zLgjJiQIgFlok0Ds+OYFkrgcJAAAgOk40XUOJ0HuSipxX7NmjYLBoCRp27ZtKi4uVklJicaPH6/6+npJ0pIlS9S3b1/17dtXEyZMkIErcyQRiADgRfTt8LNUxL/VJ0EeWOOecOI+Y8YMjR07VjU1NZKkKVOm6M4779SsWbMUDoe1aNEihUIhPfjgg3ryySc1Z84ctWvXTlVVVY4V3klWByIA4JhS3bdzYgCTpSL+rY55Py+Vyc/PV1lZWeT5+vXr1b17d0nSFVdcoffee0+rVq1SQUGBpk2bppKSEp144olq27Zt8qUGAMAAXp30sTo5MwDHD6mS8G9/CwsLtX379sjzcDiswP99ZZCbm6s9e/aoqqpKy5Yt01//+lcdd9xx6t+/vzp16qQOHTokX3IAnsTVUgDYjj7MUC4sbXGaY3P8GRnfbaq6ulqtWrVS69atdeGFF+qkk05Sbm6uunbtqg0bNjj1kQA8iAEPSJxTM720Q8BMjiXu559/vpYtWyZJWrp0qbp27aoLLrhAmzZt0q5du1RXV6c1a9borLPOcuojAQCIyk9LFkz+tupY9eCnuvEK6+vMz2vcj3TPPfeorKxMRUVFOnDggAoLC9W2bVuNHDlSgwcPVt++fdWzZ08VFBQ49ZFHsS2gbCsvANjG1EQ2FUze12OVzeTy4tioM/cZe+fUeGYOTJ5lAAAATWtqLGesN0NT9WD0nVOv/I3j29y3ZJzj24zG2BswxdM4aciNY1YfQDr5oc/xwz66oamxnLEeScsIOP9I9y6k/RORVnR0ANLJtD7H7RvQkOQDcBKJOwDAs9w+kXD7823BCY49rI5pfpwKAACQHKuTQSCNPJG4c6YOAP5D3w/TeD0mrd+/QMD5R5p5InG39Uzd+gaAtCJeYDO315o7gTaIptiaj8Sqavl0u9sBS2WQDK83cDjLlnixulMHorClDcYjne2VvgFIHok7AEd5MbkBvCqd7ZW+wVy+OaliqQwAAE0jaQPMRfu0B4k7AEfZMnNjSzkBJIY2nhpWJ/mscQdwOAYKezp1W8pJTJmN+jGXLW0cacRSGTPQccIUDBRwGjFlNurHHuQKzuA4ussTiXuqO06CFADgJMaV9OMkC75dKnPgwAGNGjVKJSUluummm7Ro0aLIa5MnT9bs2bMjz5999ln16dNHffr00fTpdjYaGjsANGRL4mlqORlXACQiocR9/vz5at26tWbNmqUZM2Zo4sSJ2rVrlwYPHqw333wz8r7PPvtM8+fPV0VFhSorK/XOO+9o48aNjhX+EFM7ZsCPaI/+EG/i6VZckCAnJhX1Rd8A1/l1jfs111yjYcOGRZ5nZmaqurpaQ4cO1fXXXx/5+/e+9z39/ve/V2ZmpjIyMlRXV6fs7OzkS30EOmbAHLRHf4g3CSMu7JKK+kp2myT+ZqAtuyuhxD03N1d5eXkKhUL61a9+pTvvvFPt27fXRRdd1OB9zZs3V9u2bRUOhzVt2jSdf/756tChgyMFPxYaNQCkB4M30s2Wb3m8zurj6tc17pL0+eefa8CAAbr++ut13XXXNfq+mpoa3XXXXaqurtb48eMT/biYMJAAAACJnCBVrD6uHlgq0yyRf/TVV19p0KBBGjdunC677LJG3xcOh3X77bfrkksu0ZAhQxIuJAAAAOB3Cc24P/nkk9q9e7cef/xxBYNBBYNB7d+//6j3LVy4UB988IHefvvtyPtWrVoV02dY/VUMAAAexfgMa3lgqUwgHA6H0/6pTdhf53YJAACAV7TpdofdSzws0iKhtRzpkfNT52Ng30vpPZH1xA2YvIJZDHiBLXFsSzmRHOo5PdJxnJP5DJJ2SPLEjLu1ibsXO2M6FniBLZd8o70BzklHe0p3m/VinuEEq4+LB36cam3izqALeBNtG04inpAoYgcmsjJxt/psD0gQcZ9ath1f28oLwBusPqFhqYw7rA4awGOOTCBtTSht61fcLq+t9Qz/IEZTg+PqLisTd68g+BEPtxO1xhxZLlvWuCM53MUyfhyD9GoqRqmPxJg6FsWENe5IhtXBj7RL1SBj2uBFu/Am6tXfx8C0fkbyd30kw8S6jBlLZQCkS6oGGQYvAKlGPwM4g8Q9BlafXQJACtjSL9pSTgBp4IGlMtw59Ri4wxoAAED8jL5zaq+Zjm9z37xbHN9mNMy4HwNJOwAAzuGbD5ggEAg4/kg3EncAgGf5LWE0dX+ZEAOckVDifvDgQd13333q16+f+vfvr7///e+R1yZPnqzZs2dHns+cOVO9evVS79699cYbbyRfYgCAsQka3EWCDDTOtzPuixcvliRVVFToV7/6laZMmaJdu3Zp8ODBevPNNyPv2717t8rLy1VRUaFnnnlGkydPjvkzGJQAoHEkaLHhOAGICKTgkWYJJe7/9m//pokTJ0qS/ud//kcnnniiqqurNXToUF1//fWR9+Xk5Oj73/++9u3bp3379sV1ZkJnCwDpxYQJTJaK+CTm48cxc1fCa9ybNWume+65RxMnTlRhYaHat2+viy666Kj3nXLKKbr22mt14403asCAAUkVtjEEEQAkz60JE/pwxCIV8ckkob/4dqnMIdOmTdNrr72m0tJS7d2796jXly5dqi+//FKLFi3SW2+9pYULF2rt2rXJfOQx0fAAM5CA2cWU+kplH27KPiYjnn2wfX9TWX7bjw0gJZi4//Wvf9VTTz0l6dvlMIFAQJmZmUe97/jjj1eLFi2UlZWl7OxstWzZUrt3706uxACMxUm0XfxQX17YRy/sQ6xSua9+Oo6pVLV8urUnQV6YcU/oMvlXX3217rvvPvXv3191dXUaPXq0srOzj3pf165d9d5776lv377KyMhQly5ddPnllyddaAAAcDSSUxzJqZtKHkrWq5ZPtzbO3Ei0ncadUwE4ijsPwy+IdfjN4cl7Y0y+c2qrfn9yfJu7K1Lz+83GcAMmAI4ikYFJUvmVfmOxbusyAqAptvfvXlgqQ+IO+BxJBrzMjUTD9uQGiIb4dheJO2CodCXUdMJAenCSDNu16XaH3XHsgRswscYdAAAAjjB5jXvr/s85vs2vn7/Z8W1Gw4w7AADAYayeVU4xjo27rE3cCRzALLRJO6Wr3tyKD+ISiWAJYeNsPjb8ONVFNgcO4EW0STulq97cio9035WVEwV4HTHuLmsTd/iPm50FHRWAIx3rpIATWHiZ7WMhM+5Iiu0NIN3cHBAZjAE7HNmvxtvP0i/7E/UeO8ZDd5G4u4jgRzxsGVhsKScSZ/Il4ZLtV+mX/Yl6j03V8unGtv1YMONuiMaCyObgAo5ky8BiSzmRuKrl06lnwKesbvseuI67JxL3xoLIqeDiBAAAGqJfRDxMj5dUls/0fY+X1/bHNtyACQAg6dsB2erZNAApFUsfYfINmE4cWOH4Nr96tp/j24wmqRn3f/zjH7ryyiu1ZcsWbdiwQSUlJQoGg7rlllv01VdfRd5XX1+vwYMHa/bs2UkXGDARMxDwApJ2mII+1Uy2r3H3goQT9wMHDmjcuHFq0aKFJGnSpEkqLS1VeXm5evbsqRkzZkTe+9vf/lbffPNN8qUFDEXCAwDOoU81l8114+sfp06bNk39+vXTySefLEl65JFHdN5550mSDh48qOzsbEnSq6++qkAgoCuuuMKB4gKAXZidAgAz+DZxnzdvntq2basePXpE/nYogf/www/13HPPaeDAgdq0aZNeeuklDRs2LO7PiGewc2JgZHAF/NUO0rWvNs9OOcmt2OKqYwC8JKEfp/bv3z9yprFhwwadfvrpeuKJJ7R8+XI98cQTevzxx9W+fXs98MADWr58uVq0aKEdO3aoefPmGjNmTJOz7/w4FQDgBH5wC6SXyT9OPfmWOY5v88uZfaO+Xl9frwkTJujjjz9WVlaW7r//fp122mmR1+fPn68//OEPysjIUO/evVVSUhJ1e0lfVSYYDGrChAlat26dKisr9fjjj6t169ZHva+srEwnnniiiouLm9ym24k7HT0AAED8SNwbev311/Xmm29q6tSpWr16tZ566ik98cQTkdd/+MMf6qWXXtJxxx2na6+9VnPnztXxxx/f6PYcuY57fX29Jk2apOrqag0dOlTBYFCPPvpoUtt082tMknYAAICj2bzMzI017itXrowsLe/UqZPWrVvX4PVzzjlHe/bsUW1trcLhcJPbTPq8qLy8XJL0wQcfRH3f0KFD49ouyTMAIFl8gwo4y+b2lIofk1ZWVqqysjLyvKioSEVFRZHnoVBIeXl5keeZmZmqq6tTs2bfpuBnn322evfurZycHPXs2VOtWrWK+nkGf6EBAEBybE4yvIQTKO+gLhs6MlE/Ul5enqqrqyPP6+vrI0n7xo0b9dZbb2nRokU67rjjNGrUKP3nf/6nfvzjHze6PUeWygAAADSGRM87bK5LN5bKdOnSRUuXLpUkrV69WgUFBZHXWrZsqRYtWig7O1uZmZlq27atdu/eHXV7nk7cbV6HBSC16B8AIH70nfHp2bOnsrKy1K9fP02ZMkX33XefFixYoMrKSrVr105FRUUqKSlRcXGx9uzZoxtvvDHq9pK+qkwquH1VGQBAdHxdDpvEG6/Ed+OaOjYmX1Xm+/8xz/Ft/s9TvRzfZjSemnHnLBAA0oOkBjaJN16Jb5jKU4n7kQ2NRB4A/I1xAHCW1Sc1gRQ80szgLzSSZ3VwAQAAwDGpuBxkunlqxh0AgMP5bQKHbxiQasSYu0jcgSb4vZNK9/77/XgjdsTK0fx2ohIN8YEjuXE5SMf3gavKAAC8iquDwGRejE+Trypz6u1/dXyb2x+/wfFtRmPw4QUAIDleS4rgLcRnenlhjTuJOwAAALzP/rydNe4AGscaUdiOGIbXEeP+kvCM+w033KCWLVtKkk499VTdcsstKi0tVTgc1rnnnqvS0lJlZmZqzpw5qqioULNmzXTbbbfpqquucqzwiI0X19AhPYgbADBbOvvpQycJto4Nvl0qU1NTI0kqLy+P/O3222/XiBEj1K1bN917771688031alTJ5WXl+uFF15QTU2NSkpKdPnllysrK8uZ0iMmtjYwAEgW/Z93ODkJxYRW7I48Vhw3dyW0VGbjxo3at2+fBg0apAEDBmj16tUqKytTt27dVFtbq507d+qEE07Q2rVr1blzZ2VlZally5bKz8/Xxo0bnd4HAACOiWUE3uFkwkjyGTsv3ZXeC5eDTGjGvUWLFrrlllvUp08fbd26VbfeeqteffVV7dixQz//+c+Vl5enDh066LPPPossp5Gk3NxchUIhxwoPAIlixs0fqGPAOba3Jy8slUloxr1Dhw7693//dwUCAXXo0EGtW7fWzp071a5dO73++usqLi7W1KlTlZeXp+rq6si/q66ubpDIR2PzGR3gFV5uh7YPQAAA/0kocZ87d66mTp0qSfriiy8UCoU0btw4bd26VdK3M+sZGRnq2LGjVq5cqZqaGu3Zs0dbtmxRQUFBTJ/BoArgcF4+iUDqEDcADvHtUpmbbrpJ9913n4qLixUIBDR58mRJ0r333qvmzZsrJydH999/v0466SQFg0GVlJQoHA5r+PDhys7OdnQHAHjboSUtTp/Ms1TGH6hjwFn0ne4KhMPhsNuFONL+OrdLALjLlI7RlHIAiSKGgfRqYfCtPTsMf9nxbX76/651fJvRcAMmWMcPX32TaABH80PbNwXHGocQC2YhcYd1SGoBf0qk7dNfAMk51Ia8kMB7YY07iXsMvBCsAAAgPbyaN7TpdofV+0binkI2BwZgk2htjdlKwJ9o+8nx4vE7dJEAL+6bTYxN3E0KDJPKAjgtWnxzAg0A8IpAwPlHuhmbuAN+dqyE2W9JtN/2FwBMR7/sPk8k7gQSvOZYs+B+++bn8P2ljSNRxE5iOG445PBYsH0cYo27IWwPJADR0caRKGInMRw3HOKtq8qwVMYIXggmwEQM3nAK/bS/UN/ew3hgBmMT93gavZfOBhvj5X3zI1vq05ZywnwM+v5CfXvH4eMAl4N0f6lMIBwOh9P+qU3YX+d2CQBIzt0untvOA4A/tGjmdgkad849rzm+zY+nFTq+zWiMnXG3SSxnnzafocK/nEq2SdpBHwhiwH1N1YHX68gLa9w9MePOrCAA4Fjo14H0MnnG/fzRrzu+zb9NvtrxbUaTcOL+1FNP6c0339SBAwdUXFyszp07q7S0VOFwWOeee65KS0uVmZmpJUuW6LHHHpMknX/++Ro/fnyTa4JYKgMAAGCepk6GSdxTK6GlMsuWLdOqVas0e/ZslZeX63//93/1yCOPaMSIEaqoqND+/fv15ptvKhQK6cEHH9STTz6pOXPmqF27dqqqqnJ6HwAAgMG8vgTDT2z+BssLS2USOi965513VFBQoF/+8pcKhUK6++67dfvttyszM1O1tbXauXOnTjjhBK1atUoFBQWaNm2aPvvsM/Xp00dt27Z1eh8AwBNY1gGvIq4BZyQ0415VVaV169bpd7/7nX7961/rrrvuUkZGhnbs2KGf/vSnqqqqUocOHVRVVaVly5bprrvu0owZM/THP/5Rn376aUyfwdk54C206aaR3CBZqW5niW6f9g8TeOFykAkl7q1bt9YPf/hDZWVl6YwzzlB2drZ27dqldu3a6fXXX1dxcbGmTp2q1q1b68ILL9RJJ52k3Nxcde3aVRs2bIjpMxjAAPtEG5xp03CD3xLGVLcz2jH81qZMk1DifvHFF+vtt99WOBzWF198oX379mnMmDHaunWrJCk3N1cZGRm64IILtGnTJu3atUt1dXVas2aNzjrrLCfLD8AgVcun06nDKCSaZohWD7H2GfH2LfRFqWFzm/LCGveEryrzwAMPaNmyZQqHwxo+fLhyc3P1wAMPqHnz5srJydH999+vk08+WS+//LJmzpwpSbrmmms0ZMiQJre9v461ngAQL/rNo3FMEsNxw7EcOhmy9aoyHcctdHyba3/zb45vMxpjr+NOp/EtjgMAJI4+FEhOvG2IxD21jL1zKh3ttzgOsAlfTQM4FvoGex2Zh9hcl174caqxM+4AAACwi8kz7heNX+T4Ntf8+l8d32Y0xs64O8Hms0IAQPIYBwBn2dymvPDjVGMTdycCg2UmAOBvjANAQ8nmVza3KS8slTE2cQcAAICzbE68YXDiTmA5z+avtwAAgPtsziVYKgOrcDKEI9ncASO1vBIbXtkPwBTkEu4icQd8rKkOmKTHvxic/Y22Dy9ijTsAT6taPp0BHPAhTtzgRSyVAeB5DOCwGfELOOPQJA6TOe4icQcAxI3BG04insx36CTY5pNhlsoA8DQGUzTGlsGbGLaDLfEEuC2hxH3evHkKBoMKBoPq27evLrzwQu3evVuStGDBAhUVFUXe++yzz6pPnz7q06ePpk+nYQKxINkAAJjg8PHI9rHJt2vce/XqpfLycpWXl+sHP/iBxo4dq1atWmnDhg2aO3euwuGwJOmzzz7T/PnzVVFRocrKSr3zzjvauHGjozsA2Kipzu9Ys09udJiJzILZ3rEjNYgLpBPx1rh4j83h44Dt34z4fqnMRx99pM2bN6uoqEhVVVV66KGHNHr06Mjr3/ve9/T73/9emZmZysjIUF1dnbKzs5MuNGC7RDo/NzrMRAY/2zt2pAZxgXQi3hqX6LHhx6lmaJbMP37qqaf0y1/+UgcPHtSYMWM0evToBol58+bN1bZtW4XDYT3wwAM6//zz1aFDh6QLDQBALEjgAGd4oS25sbTFaQnPuO/evVuffPKJLr30Uq1fv17btm3ThAkTNGLECG3evFmTJk2SJNXU1Oiuu+5SdXW1xo8f71jBAaSeFzpqAAC8IuHEffny5frnf/5nSVLHjh318ssvq7y8XI888ojOOussjRkzRuFwWLfffrvOOecc/eY3v1FmZqZjBQeQenwlmloc39Rz+xi7/flwh5fr3eYJHS+scU94qcynn36qU089Nep7Fi5cqA8++EC1tbV6++23JUkjRoxQ586dE/1YAGlkcwdtA46v91HH/kS9I1USnnEfPHiwBg4ceNTfTz31VM2ZM0eS1LNnT3300UeRK9CUl5dbl7S36XZHk2fOXj6zRmoQM/ahzszmt/rx2/46LdXHz+T6SSaniSUnMpkXLgcZCB+6dqNB9tfF9/423e7g7BYpQWwhnYg3ALZrkdRlT1Lr8gffdnyb747q4fg2o/HEnVMZ6JAqxBbSiXjzJ5tnMAGklycSd7c53enSiQPeQFt2nw11wAkbkB5eWCpD4u4ApztdW260A9gsHTFPQvYd+hi4jRiEF5C4O8ALnQEJBvzGhJj3Qt8RK7eOtwn1DDNinVhwhgl1mSgvXA6SxN0BdAYAEkHfAVM5nZwR6/Y6MhZsrksSdwAA4Dk2J2dw1qFYOJTA2zzj7gUk7gAalUgHTacOAN5zKIG3+aSOH6cCwBFs7tThDZw8AvAqEvcEMTCYjfoB/Ovwk0f6AgCHsMY9hUzvbJlVNBv1AwDeEk9e4FYOYXrukow23e6wfv9YKgPA0zgBgu2IYX/i8qPOq1o+XVXLp1ufvNvO2MTdy8EPeBmdOoBUIC8wg8314NulMgcOHNDIkSPVr18/lZSUaMuWLZHXFixYoKKiosjzmTNnqlevXurdu7feeOON5EsMIGWOTLoTScJt7tQBANExOeOuhBL3JUuWqK6uThUVFfrlL3+p3/72t5KkDRs2aO7cuQqHw5Kk3bt3q7y8XBUVFXrmmWc0efJk50oOwHEk3fAakozEcNxwLG263WH1OOHbNe4dOnTQwYMHVV9fr1AopGbNmqmqqkoPPfSQRo8eHXlfTk6Ovv/972vfvn3at29fXF8p0GkA3kF7BgD72Zy0e0WzRP7Rcccdpx07dujHP/6xqqqq9MQTT2jMmDEaPXq0srOzG7z3lFNO0bXXXquDBw/qP/7jP2L+DIID8A7aMwDAbRluTJE7LKEZ92effVY//OEP9dprr+nFF19USUmJNm3apAkTJmjEiBHavHmzJk2apKVLl+rLL7/UokWL9NZbb2nhwoVau3at0/vAbB4A4Jg4aUxMuo/boXE82njOWG8Gm+vBt0tlWrVqpZYtW0qSjj/+eLVr107z589XeXm5HnnkEZ111lkaM2aMjj/+eLVo0UJZWVnKzs5Wy5YttXv37pg+w6TAMKksyYqlc4T7nKyfdNe1jbEVS5lt3K94OLF/bh6jxj7b6/WWKvEcz1j/FotoJwychMWPvs17AuFDvySNQ3V1tUaPHq2dO3fqwIEDGjBggK677jpJ0vbt2zVixAjNmTNHkvToo4/q7bffVkZGhrp06aK77767ybXu++uif360H0fY/sMJwCS0JyA+8bYZ29qYbeVF+rVIaBF2ehQ+vszxbb52+yWObzOahBL3VNtfZ1bnYFJZgHQyKfZNKgvckUgMEDdAepG4pxY3YIqBSWWBN5jy1aQp5YgF7dBuTsSajTFgUxs7nK3lRtOSrVubYyMj4PyjKfX19Ro3bpyKiooUDAa1bdu2BvIrxLoAACAASURBVK+vXbtWJSUlKi4u1q9+9SvV1NRE34dkDgCAxJiSgJhSDnifX29B7/bnA0dK5FurZP69Sdy4c+rChQtVW1uryspKjRw5UlOnTo28Fg6HVVpaqilTpmj27Nnq0aOHduzYEXV7xibuiZzR2XwWCACIjj4+fZxOzqg7ex2KhTbd7qAeE7By5Ur16NFDktSpUyetW7cu8tqnn36q1q1b649//KNuvvlmff311zrjjDOibs/YxD2RTsPms0D4S1OdH50jcLRE+njaUvxSccycGJ+pS3dVLZ+uquXTra6HVFwOsrKyUr169Yo8KisrG3xmKBRSXl5e5HlmZqbq6r69CktVVZVWrVqlkpIS/eEPf9B//dd/6f3334+6Dwb/hADwrqYGMVNOQk0phwn4kSOcZHI8US4cy6GEnXpoqKioSEVFRY2+npeXp+rq6sjz+vp6NWv2bfrdunVrnXbaaTrrrLMkST169NC6det02WWXNbo9Y2fcbWDzWSfgd/G2XwYrs9l2HXriCTZqbMbdlnwokIL/mtKlSxctXbpUkrR69WoVFBREXmvfvr2qq6sjP1hdsWKFzj777Oj7YOrlIAHELlWzdybPCgIA0q+pccHky0H++9PLHd/m/CHdor5eX1+vCRMmaNOmTQqHw5o8ebL+9re/ae/evSoqKtL777+vhx9+WOFwWJ07d9bYsWOjbs8TiTvJBYBjoW8AgPQicU8tTyyViXYXVcAviPej+S1pJwaOxjEBvmXbcrJUcONykI7vgxdm3AEAAJB6Ni+VuX7GCse3+eKtXR3fZjSemHEHAOBYbJ8h9ArqASZIxeUg043EHQB8wo/Jk9+WS0lm1rMf6wFIBRL3NDGxIwXgL24mT/SB6WNbkkxs2MW2+DpcRiDg+CPt+5DIP6qtrdXIkSPVt29fDRo0SFu3btX69evVo0cPBYNBBYNBvfLKK5H319fXa/DgwZo9e7ZjBbeNzYEOAECqMD7axeYTLS8slUnoJwRz5szRcccdpzlz5uiTTz7RxIkTdc011+jnP/+5Bg0adNT7f/vb3+qbb75JurAAksclEuEnxDsAL0loxn3z5s264oorJElnnHGGtmzZonXr1umtt95S//79NXr0aIVCIUnSq6++qkAgEHk/YDubZxskZrfgDuLOLrb3c0hMU/Vue1x44XKQCSXu5513nhYvXqxwOKzVq1friy++0AUXXKC7775bzz//vNq3b6/HHntMmzZt0ksvvaRhw4Y5XW7ANSQggD1or4nhuPkT9W6+hJbK9O7dW1u2bNGAAQPUpUsX/eAHP1BhYaFatWolSerZs6cmTpyoQCCgL774Qj/72c+0Y8cONW/eXO3atWP2HYgBX/HDZMQnANu4sSbdaQndgGnVqlX68ssvVVhYqI8++kjPPPOMtm/frtLSUnXs2FHl5eX6/PPPdffdd0f+TVlZmU488UQVFxc3uX0bbsDEoAUAsaPP9Dfq3z9MvgFT0R9XOb7Nyp91dnyb0SS0VOa0007T7NmzVVRUpN/97ne69957NWHCBE2ePFnBYFAffvihbr/99qQKZvo6qlg7INP3AwDSwa2kjT7YDCTt5ki2TdCm3JXQjHuq2TDjDuDYmFkDEC/6DXs0VVcmz7j3S8GMe4UNM+4A0BgGX5iE2UE7+LXfcDs+E/l8v9aVKUjcAQCeRZIBk7kdn4l8vtsnG8nw7eUgYXfgAgAQDWNc6nBs3ZMRcP6R9n1I/0fGxvTAdvssGQDQNNPHElM5PcZRD98hf0AyjE3cCWwAgFf5LZFlTIcJWCoDwPP8lmAA6UAi2xD9TPpwrO1m8EV7AJiABAM2I37tQD2lj5+PtRfunMqMOwBHMZvjD9QzAKQfiTtcRwLgLX67vJhf2TJrR2wBOIQ17oADbEkAkDrEAFKF2LKL10+0TNy/eMpkYvnjweUgAQCIgVsDvu2Jht94/UTLxP2Lp0wmlt9vSNwBuCqWxIrky35uDfgkGmajbSOdfLNUZs2aNQoGg5Kkbdu2qbi4WCUlJRo/frzq6+slSUuWLFHfvn3Vt29fTZgwQeFwWPv379fQoUNVUlKiW2+9Vbt27UrdngCwUiyJlVPJF0kCYBZOrID4NJm4z5gxQ2PHjlVNTY0kacqUKbrzzjs1a9YshcNhLVq0SKFQSA8++KCefPJJzZkzR+3atVNVVZVmz56tgoICzZo1SzfccIMef/zxlO+Qn5GUoDHExrdIEoDYmNhnmFgmvzh07Nt0u8PqfjSQgke6NZm45+fnq6ysLPJ8/fr16t69uyTpiiuu0HvvvadVq1apoKBA06ZNU0lJiU488US1bdtWK1euVI8ePSLvff/991O0G5BIStA4YgNAPEzsM0wsk18cOvZVy6dbfQKVEQg4/ki3Jm/AVFhYqO3bt0eeh8PhyJqe3Nxc7dmzR1VVVVq2bJn++te/6rjjjlP//v3VqVMnhUIhtWzZssF7AQD2s33mDUD8bE7avSLuH6dmZHz3T6qrq9WqVSu1bt1aF154oU466STl5uaqa9eu2rBhg/Ly8lRdXd3gvV5GQKcHxxlO8GIcpXOfSNoB/7G93QcCzj/SLe7E/fzzz9eyZcskSUuXLlXXrl11wQUXaNOmTdq1a5fq6uq0Zs0anXXWWerSpYuWLFkSee/FF1/sbOlddKwB0vaAtgXHGU7wYhx5cZ+S5cUTtHTguHlXtLql3s0Xd+J+zz33qKysTEVFRTpw4IAKCwvVtm1bjRw5UoMHD1bfvn3Vs2dPFRQUqLi4WP/93/+t4uJiVVZW6o47vHORfwZIwF9M75NMx/GzS7rHOFviw5ZyJsrruY0XLgcZCIfD4bR/ahP217ldAgAAkAr8PsJuTdVfiyZ/Peme/5i73vFtPnXTDxzfZjTcgAnW8fqMR2P8ut8AvIWk3W7Un7tI3GEdv3Yaft1vmIOTR+e4dSypQxwukXiwOYa8cDlIEnfAIDZ3iLbhWMfPxpNHU+vZrWNpYx0idYgH+5C4AwahE00fjrU/UM+JMfWEB0iGLy8HCQBAvEgE7cIJDxpDbLiLxD0GDDgwAXEImzHY24X+Bo2xOTa8cDlIEnfAEiQ+QPxsTjLclIr+hrowg5/rISMFj3QjcY8BCRMAAMlhLPUG6tFdJO4AAM8iyfAvP88sR5NMm2jT7Q6rjytLZZA0mxsAgNSjjwASw0lbbOLpYzim7iNxd1AiAyyNAEA09BGA95h0Qu6nPiYj4Pwj7fuQ/o/0Lj8FPwCkiklJzeFMLRfM01Ss2Jov2N4GSNx9wvZAhTcQh/ALU5MaU8uVCvQ3yfFqrHh1v2wSc+K+Zs0aBYPByPM33nhDI0eOjDzftm2bBg4cqP79++vnP/+5qqqqJEnz5s1Tnz591KtXLz322GMOFh3wFzpMmMSWxM6WcpqG/iY5xJ2ZvPDj1GaxvGnGjBmaP3++cnJyJEn333+/3nnnHZ133nmR95SWlmrEiBHq1KmTXnvtNW3dulV79uzR7NmzVV5erqysLD366KM6cOCAmjdvnpq9AQCkBYkdnNSm2x2eiikv7cvhOCFxX0wz7vn5+SorK4s879KliyZMmBB5vn//fu3atUuLFy9WMBjU6tWr1bFjR7333nu64IILdM899+jmm29Wly5drEzavdoAAcDr6L/tQD2ZIZa1+TbXlW/WuBcWFqpZs+8m53/yk580+Hrgm2++0X//93/rsssu05/+9Cd98803+stf/qKqqiqtWLFCkyZNUllZme6//37t3r3b+b2wAGepSBYxBMSPdgPELpak3OY2FQg4/0g3R36cevzxxys3N1eXXnqpAoGArrrqKq1bt06tW7dW9+7dlZeXpxNOOEFnnnmmtm7dGtM2TQoMJ8pi8xkqzOCXGDKp7cN+fmk3pqNdewP16D5HEvcWLVro9NNP14oVKyRJy5cv19lnn60uXbrogw8+UE1Njfbu3astW7YoPz8/pm2a1NmaVBZ4A51f42hv9iCOEaum2jWxZAfb++eMQMDxR7rF9OPUWEyePFm//vWvdfDgQZ166qm66667lJWVpd69e6u4uFjhcFi33367WrdundTnHGrctgcP/I34hRcQx3BKOmLJ5B/Amly2I9lSTq8KhMPhsNuFONL+OrdLALjPpo4c/kN8Ao3zavuIZb9aODYl7LzRr2xyfJuTf1Lg+Daj4QZMgKG82OkDgB+Y3H8nsyypavl0ljW5jMTdQjQaAG4zOTEB0Lhk267NbZ+rysAVNjcap/jx5MWP+wwky8l2QxsE7OaFH6eSuMfJqY6bASA5fjx5iWefiS/AeX7sd0xiU79mU1nj0abbHZ7dN1uQuMfJqY6bAQCpRHzBNAz2SJZN/ZpNZY2XzfvGUhkAAGLg1mBvc5IBmIYfp7qPxB1wAR0fAADplRFw/pH2fUj/RwJgFhB+w8kqYAY/t0V+nAogbfzc2QKwmxP9F30gQOIOWINZetiM+PWGRJNnJ+rfphgy+STDpuPoNH6cCgBADGJNZJxOeExOoGzk56QvHhwnpAqJO1zHwAp4X6yJDAmPGeiXk+PV42f7fvHjVMABDNQAYBb65eTYfPyiJec275ckBVLwX7rFlLivWbNGwWBQkrRhwwaVlJQoGAzqlltu0VdffaUNGzYoGAxGHhdeeKGWLl2qPXv26Be/+IVuvvlmFRUVadWqVSndGQBAetgy82Z7ogHn2BKzbqPNmK3JxH3GjBkaO3asampqJEmTJk1SaWmpysvL1bNnT82YMUPnnXeeysvLVV5erpKSEl199dW64oor9Ic//EGXXnqpnnvuOU2ZMkW/+c1vYi6YKQ3MqXKYsj9+wLEGUs+WwZ3+AIfYErOp1lSbaOz1Q3+3+Tj6YqlMfn6+ysrKIs8feeQRnXfeeZKkgwcPKjs7O/La3r17VVZWpjFjxkiSBg4cqH79+h3zvU0xJTCcKocp++MHHOvoSGQSY+pxM7VcpqA/ABpqqk009vqhv9PnuKvJxL2wsFDNmjWLPD/55JMlSR9++KGee+45DRw4MPLa3Llzdc0116ht27aSpFatWqlFixbauXOnRo0apREjRsRcMAIDSI0jO2XaWmxMTQBNLZcpiG/AWTb3Ob6YcT+WV155RePHj9fTTz8dSdIlacGCBerTp0+D93788ccaOHCghg8fru7du8f8GTYHBpAKqUpAaGvwMuIbiC7esYWTYXfFnbi/+OKLeu6551ReXq727dtH/r5nzx7V1tbqlFNOifxt8+bNGjZsmB5++GFdeeWVzpTYUgQ6kkUCAgBI1pH5SDxji+25TCAQcPyRbnEl7gcPHtSkSZNUXV2toUOHKhgM6tFHH5Ukffrpp2rXrl2D9z/88MOqra3VpEmTFAwGddtttzlXcsuQdMH2Dg9IhlvxT7sDGkomH7E9l/HCUplAOBwOp/9jo9tf53YJAADwrzbd7rA+SYPzYrmyTItmjb7kuoeXfOL4NkdeeYbj24yGGzABlmDmELAf7RhwTyDg/CPdjE3c6dyAhmyZ/aLtAo2zpR3bUk7EL5k+umr5dGLDZcYm7gQGYKd42y6Jvj+wxh3wBpvbVEYg4Pgj7fuQ9k80lM2BCNiMk3R/oJ4BMyTbFm1uy174cSqJ+/+xORABeBMTCsmjbwfgpvr6eo0bN05FRUUKBoPatm3bMd9XWlqqhx56qMntkbjDGiQx8BuSzuTRbwA4xI0fpy5cuFC1tbWqrKzUyJEjNXXq1KPeU1FRoU2bNsW0DyTusAZJDAAAsMnKlSvVo0cPSVKnTp20bt26Bq+vWrVKa9asUVFRUUzbM/hqmwAAJIcTfgCHZMj5RemVlZWqrKyMPC8qKmqQhIdCIeXl5UWeZ2Zmqq6uTs2aNdOXX36p6dOna/r06frP//zPmD7PU4k7N4wAAABIDdvzrFRcBObIRP1IeXl5qq6ujjyvr69Xs2bfpt+vvvqqqqqqNGTIEO3cuVP79+/XGWecoV69ejW6PU8tlbE5mAAAzmONO+CcquXTaVNx6tKli5YuXSpJWr16tQoKCiKvDRgwQPPmzVN5ebmGDBmin/70p1GTdskjiTtBBADpZUu/y4QOkJwj27rNbcqNy0H27NlTWVlZ6tevn6ZMmaL77rtPCxYsaLC8Jh6BcDgcTuhfptD+OrdLAAAAgMPFslSmhcGLsJ98f6vj2/zFZac7vs1oPDHjfogtM0AAEA/6tsRx7ADn2L5Uxgt3To1pxn3NmjV66KGHVF5ers2bN6u0tFThcFjnnnuuSktLtWnTJk2ePDny/tWrV+uxxx5T586dNXz4cO3bt0/NmzfXgw8+qJNOOqnJQjHjDtjL9h8vITbUM4BjMXnGfcayY9/8KBm3XnKa49uMpskZ9xkzZmjs2LGqqamRJD3yyCMaMWKEKioqtH//fr355ps677zzVF5ervLycpWUlOjqq6/WFVdcoXnz5qmgoEDPP/+8fvKTn2jmzJkp2Qmbz/4AryGZg0kYH/zJlHo3pRyxilZe2/bFq5pM3PPz81VWVhZ5XlZWpm7duqm2tlY7d+7UCSecEHlt7969Kisr05gxYyRJBQUFkUvghEKhyOVvnEaiAADpZUu/a0s54SxT6t2UcsQqWnlt25dj8cJSmSYT98LCwgYJd2Zmpnbs2KGf/vSnqqqqUocOHSKvzZ07V9dcc43atm0rSWrTpo3efffdyGz7TTfdFHPBTDqzM6ks8C9b4tCWcsIfiMfExHvcTDzOJpbJCziu7oppjfv27ds1YsQIzZkzp8Hf//znP2vFihWaNm2aJKlPnz569NFHdcopp0iS7rjjDv3whz9Uv379tHHjRo0aNUoLFixoslCscQfgd6whB2Ajk9e4P7P8745vc1C3fMe3GU3cV5X5xS9+oa1bt0qScnNzlZHx7Sb27Nmj2traSNIuSa1atVLLli0lSSeccEKDO0elgslngSaXDbGjHpEuhyftxB2AZNCHeEfc50VDhgzRvffeq+bNmysnJ0f333+/JOnTTz9Vu3btGrx32LBhGjt2rGbNmqW6ujpNnDjRmVI3wuTZKZPLhthRj3BCvLPpxB2AZNCHfMsL10DnBkwArMaSEkRDfCAZxM/RmjomJi+V+eOKzxzf5s+6tnd8m9F44eQDgI8xqAJIlUT7F5amIFVI3AEASDMSOzskWk9enlCwed8CKXikmycSdzpAHAtxAcBUNic/fkI9wTSeSNxpWDgW4gIA/YA/MXGTOjYfW1/cgAl2s7mBeVkq6oW69ibqNTkcPyQi2bjxatzZvl8slYHxmG0yUyrqhbr2pljq1fbBFEiFZNpFtHYXy3a92h97db9sQuIO65CkAA25MZjG2w5pt0hWvDGUqnbhteT1yOPa1HG2ef8DAecfad8HruMOAKnDdaDdxfEHnGXzddxnfbjd8W2WdDnV8W1Gw4w70oYZN/gRSSMAW3h9nA4EAo4/0r4PzLgDgLmYMQbiR7txj8kz7pWrdji+zaLO7RzfZjTMuANAHNI9I0XyAcSPdpM6Xp+VNx2JOwDEgYQAsSLBgRfZ3Ad6YalMTIn7mjVrFAwGG/xtwYIFKioqijyfOXOmevXqpd69e+uNN95o8N4tW7bo4osvVk1NjQNFBvyJJACIn5vtxuYEB8nxcn/t5X2zQZMrkWbMmKH58+crJycn8rcNGzZo7ty5OrQ8fvfu3SovL9frr7+uffv26YYbblDPnj0lSaFQSNOmTVNWVlaKdgHwh1QlAawFBQBneblPtXnf3LhhktOanHHPz89XWVlZ5HlVVZUeeughjR49OvK3nJwcff/739e+ffu0b9++yFcH4XBYpaWlGjFiRIPE3484Q4WpbO6E4R5b+jTiG4jOlraMbzWZuBcWFqpZs28n5g8ePKgxY8Zo9OjRys3NbfC+U045Rddee61uvPFGDRgwQJI0ffp0XXnllTr33HNTUPSjmRx8DB4AvIQ+DX5gcl7hFD+1Zd+scT9k/fr12rZtmyZMmKARI0Zo8+bNmjRpkpYuXaovv/xSixYt0ltvvaWFCxdq7dq1mj9/vl544QUFg0Ht3LlTgwYNStV+SPJX8CE+fuh8ARMk29acbqumtn1Ty4WG3MorTIoPk8qSrIwUPNItrqttduzYUS+//LIkafv27RoxYoTGjBmjFStWqEWLFsrKylIgEFDLli21e/fuBj9S/dGPfqRnnnnG2dIDMeKkDkiPZNua023V1LZvarlgBuIDjXHkZKFr16668MIL1bdvXxUVFen000/X5Zdf7sSmAfiIl2Z20BB1C9jpyJMIm9uyF5bKcOdUAADgKVwtyz0m3zn1L2v/1/Ft3tjxe45vMxpuwATAMTbNxNhUVgDxIWnHsQRS8Eg3EnfAEjYkmjYNljaVFYmzod0ASI9AwPlHupG4IyYMfu4j0XQXbQB+4vV4T/X+efX4tel2h2f3zRascQcAeBZrnf2JenePyWvcF3z0hePbvO7Cf3J8m9Ew4w4A8CySN7iBWWmkCok7AAAwXjzJsNsnbG5/Po6NNe4+xtk0AJiPvto74k2GqXscKZCC/9KNxD1BdCBIBvEAAKnlpRsHmYTj6C4S9zThazMcjniAxAAIpBP97rHF2w/ZfBxZKgMASJjNAyAAb6haPp1JBIuQuAON8EtH5vR++uW4AV7mxXbsxX1ySqyTCLZfxz1DAccf6cZ13AEAABzk1evIx7JfJl/H/dX1Ox3f5jU/OMnxbUbDjDsAR9k8GwMAaJzty2p8s8Z9zZo1CgaDkqT169erR48eCgaDCgaDeuWVVyRJTz/9tK6//nr1799fixcvliTt379fQ4cOVUlJiW699Vbt2rUrRbuBI9ncsLzO63XjxVkm2Mvr7Q1moh80kxcS9yaXysyYMUPz589XTk6O5syZoz//+c/as2ePBg0aFHnPxx9/rFGjRunPf/6zJKlfv36aNWuWKioqFAqFNHToUL388statWqVxo4d22ShWCoDAN7i1aUDgG1S3RZNXirz+gbnl8pcfZ5hS2Xy8/NVVlYWeb5u3Tq99dZb6t+/v0aPHq1QKKQtW7aoe/fuys7OVnZ2tk477TR9/PHHWrlypXr06CFJuuKKK/T++++nbk+U3pkVZnEAIHYk7YAZkm2LNuc/vrgBU2FhoZo1++70qWPHjrr77rv1/PPPq3379nrsscd0zjnnaMWKFQqFQqqqqtKqVau0b98+hUIhtWzZUpKUm5urPXv2xFywRAIjnQMDgxBsZXOnCwCxcLufc/vz42FTWZHAj1N79uypCy64IPL/f/vb33TmmWeqf//+uvXWWzVt2jRddNFFatOmjfLy8lRdXS1Jqq6uVqtWrWL+HBJjIDVoW+6zaaC0qazwHlvjz6Z+1qayJisj4Pwj7fsQ7z+45ZZbtHbtWknS+++/rx/84AfatWuXqqqqNHv2bI0ZM0aff/65zj77bHXp0kVLliyRJC1dulQXX3yxs6UHAAvZNFDaVNZjsTXxA0xlc5/ghaUycf+EYMKECZo4caKaN2+uE088URMnTlRubq62b9+u3r17q3nz5rr77ruVmZmp4uJi3XPPPSouLlbz5s318MMPp2IfAAA4JreTDH6Um5xEjx3HPHWIaXdxAyYcEw3TPNQJ3GB73NleflMcOo4cTzTF5KvKLP74H45v86pzTnB8m9EYm7jTOQAA0LhkxknG2NTy8/ElcU8t7pwKAGiU7WvEbS9/NMkkhn5NKtPF1uPbVHuxvT15YY27sTPuAACkWjpnRv08Cwv/MHnGfemmXY5v84qCto5vMxpm3GNg+xkmACTK9v6vqfJz/w972B6LgBNI3GPgRGdLhwPARrYnm7aX3+8OHzupy9TwU37ihaUyJO5pQocDAEB8bB07TU6GjyxbvMfY5H3zAxJ3uM6PnYAf9xlwA23Nn9yud5O/qU+mbG4f12QFAs4/0o3EHa6zdUYlGV7aZ9s7csBP/NJevdDHmrgPJpbJb0jcASSFjhwmIz4bivV4mJLgJ1oOU8rvRTa3qUAKHulG4g7ASAy8cAJxlBhTkrPDy0FdIlkZgYDjj7TvQ9o/MYVo1ID7nGqHpiQOAJyRbN9AnwBwAyYAgIdx0yPAOYdOvqK1KZNvwPRfm792fJuXntXa8W1G46kZdwDe5NVv02zdL1vLbZJDx9DtY+n256NxJtZN1fLpnAi7LKYZ9zVr1uihhx5SeXm5/vGPf2js2LHavXu3Dh48qAceeED5+fl6+umn9fLLLysvL0+DBw/WVVddpYMHD2rKlClat26damtrNXToUF111VVNFooZdwAAAPsYPeO+JQUz7mcaNuM+Y8YMjR07VjU1NZKkBx98UNddd52ef/553Xnnnfrkk0/08ccf66WXXtKcOXP0zDPP6NFHH9W+ffv04osvqq6uThUVFXriiSe0bdu2mAtm4pkmAMAujCXf8ttx8Nv+povtx9UXd07Nz89XWVlZ5PmHH36oL774QgMHDtSCBQvUvXt3bdmyRd27d1d2drays7N12mmn6eOPP9Y777yj733vexoyZIjGjh2rH/3oRzEXjK9iADs51bHbPkAAJvHbmJrK/aVvgpuaTNwLCwvVrNl333vs2LFDrVq10rPPPqtTTjlFM2bM0DnnnKMVK1YoFAqpqqpKq1at0r59+1RVVaVt27bpqaee0q233qr77rsvpTsDwH1ODZh+SzQAOCeVybWf+ybb992Xd05t3bp1ZOb8Rz/6kdatW6czzzxT/fv316233qpp06bpoosuUps2bdS6dWv9y7/8iwKBgLp3766tW7c6XX4AgAWYpUQ62Z5gAo2JO3G/+OKLtWTJEknS8uXLddZZZ2nXrl2qqqrS7NmzNWbMGH3++ec6++yzG7x348aNOuWUU5wtPQDfIhG0C4kU4A02972+vHPqPffcoxdffFH9+vXT22+/rV/84hdq06aNtm/frt69e+vWW2/V3XffrczMTPXtJqXhiAAAIABJREFU21fhcFh9+/ZVaWmpfv3rX6diHwD4EImgP9icJKSD14+P1/cPaeaBzJ0bMAEAACMlegMtt2+85fbnu8nky0Eu//Qbx7fZrcPxjm8zGmNvwMRZtrs4/oC9aL/f4VjYLdHk1+2k2YnPJ3ad54XLQTLjDgAG8/PMHdzjdNwRx/5h8oz7ik93O77Nrh1aOb7NaIydcQeQGszi2IVkB25wOu6a2h79EtLBl5eDBGA3EkE4wZZEy5Zy+h39EhAbTyXudNAAkB62JFq2lBMwlZdyKw9cVMZbiTsdNACkh5cGcwCN81Ru5YHM3VOJe6owQAFAQ7YM5vTfdqP+zEOduIvEPQa2DFAAAHhJqsZfks/E2ZwTeeFykJ5I3A81QBoibEK8Aqlnc5KB1CEuEsfYFZ/6+nqNGzdORUVFCgaD2rZtW4PXX3rpJfXp00f9+vXTuHHjVF9fH3V7nkjcDzVAGiJs6lCIV9jMlrZmSznhLV6NO9v3y43LQS5cuFC1tbWqrKzUyJEjNXXq1Mhr+/fv129/+1v96U9/UkVFhUKhkBYvXhx1e55I3G0IJBvK6AUkw+4j1v2BtgY0zqvtw/b9cuO3qStXrlSPHj0kSZ06ddK6desir2VlZamiokI5OTmSpLq6OmVnZ0fdnicSdxsCyYYyAk4g1r2BEzCYgDg0D318Q5WVlerVq1fkUVlZ2eD1UCikvLy8yPPMzEzV1dVJkjIyMnTiiSdKksrLy7V3715dfvnlUT/P2MQ9nsbKGne7eKGenNgHU7ZhExv3N9kym7rPppYLMAHtw1ApmHIvKirSvHnzIo+ioqIGH5mXl6fq6urI8/r6ejVr1qzB82nTpundd99VWVmZAk2svwmEw+FwU/u5Zs0aPfTQQyovL9eGDRs0fvx4ZWZm6vTTT9ekSZOUkZGhOXPmqKKiQs2aNdNtt92mq666Snv27NGoUaMUCoV04MAB3XvvvercuXNTH6f9dU2+BQAAeFybbncww2uQWOqjRbOoL7tqzWd7HN/mRe1bRn39tdde0+LFizV16lStXr1a06dP1+9///vI62PHjlVWVpbGjh2rjIym59ObTNxnzJih+fPnKycnR3PmzNEvf/lL9e3bV1deeaVGjhypa6+9VhdeeKEGDRqkF154QTU1NSopKdELL7ygJ598Uq1atdLAgQP1ySefaOTIkfrLX/7SZKFI3IGj2TKA2VLOdOKYuIdjDzirqTZlcuK+9rOQ49vs2D4v6uv19fWaMGGCNm3apHA4rMmTJ+tvf/ub9u7dqwsuuEC9e/dW165dIzPtAwYMUM+ePRvdXpOHNz8/X2VlZbr77rslSeedd56+/vprhcNhVVdXq1mzZlq7dq06d+6srKwsZWVlKT8/Xxs3btTAgQOVlZUlSTp48GCTC+4BNC5aR2lScmJKOUySzDExqW6BdLM1/m0tdyxs3q9YrgLjtIyMDP3mN79p8Lczzzwz8v8bN26Mb3tNvaGwsLDBWpxDy2N+/OMf6x//+IcuueQShUIhtWz53VcFubm5CoVCatWqlVq0aKGdO3dq1KhRGjFiRFyFA7zK6fWPNnek8fDjulG/1C1wLKbHf2N9kunlToYf+2GTxP3j1EmTJun555/Xq6++qhtuuEFTp049auF9dXV1JJH/+OOPNXDgQA0fPlzdu3d3ruSAxbzcqacSxw3xImaQSn6ML5v32Y3LQTot7sT9+OOPj1zW5uSTT9bu3bvVsWNHrVy5UjU1NdqzZ4+2bNmigoICbd68WcOGDdPDDz+sK6+80vHCH4mzQADA4RgXgPSgraVH3In7/fffr+HDh+vmm2/WrFmzNHz4cJ100kkKBoMqKSnRz372Mw0fPlzZ2dl6+OGHVVtbq0mTJikYDOq2225LxT5EgsXms0CJoG+KKcfHlHIgOdSj3WKtP9vHBcSPtu0OK9qaB6bcY7ocZLp58aoyXv6hihs4nkfjmHhLIvVJDCAaN+PDy7Hp5X1LhMlXlVm3w/mrylzQLvpVZZxG4g7Al5oabBmMzRZr/ThZj8QEnGRKPDldDpMT9/U7qpt+U5x+0C7X8W1GQ+IOI5jSgQHwFtv7FtvL71derbdYliabnLj/7X+cT9zP/z6JO4k7ACBlEk2qUpWMeTXJgz+RuKdW3D9OBQAb8YM1fzpWvSeaJKcquSZph2RPH2VLOY/FA79NJXGHfWzuNOCeWNdDm8bEMvkVdYFUSvUJnBPxSxtwH0tlAAAA0uDwZVG2LpFqqtwmL5XZ8LnzS2XOO4WlMgAAOMLJGUJmG5GswxPequXTiak0C6Tgv7TvAzPuAABEZ+vsKJBuJs+4b/x8r+PbPPeU4xzfZjTMuAPwLWa7ECuS9tSiLdrD5roKBJx/pH0fmHEH4HXMlsI0xCS8yuQZ94//1/kZ93O+x4w7ADiKBMm/TJ0dJCZhK1PbVCy4HCQAAD5hc8ICc9kUV9Z/U+SBzD2mxH3NmjUKBoOSpPXr1+umm25SSUmJJk6cqPr6eknSnDlz1KtXL/Xt21eLFy+WJO3Zs0eDBw9W//79NXDgQO3cuTPmgpkUyCaVBd/yY534cZ8Bk1idsBiKfs2uuLKprF7VZOI+Y8YMjR07VjU1NZKk0tJSjR49WrNmzVJeXp4WLFignTt3qry8XBUVFZo5c6YeeeQR1dbWat68eSooKNDzzz+vn/zkJ5o5c2bMBYs3OFLZ+AlU8/ixTmzeZwZnAMdic78WK/o/c3jhcpBNJu75+fkqKyuLPP/iiy/UpUsXSVKXLl20cuVKrV27Vp07d1ZWVpZatmyp/Px8bdy4UQUFBaqu/vZi96FQSM2ape4XC35o/PA3mzt/W9qnzccYMJmf25ab/V8qjruf69IETSbuhYWFDRLu9u3b64MPPpAkLV68WPv27VMoFFLLli0j78nNzVUoFFKbNm307rvvRmbbb7rpphTsAuAPtiS/NuMYozEkK8mhbbnDieN+ZOzbXJdeuBxk3D9OnTx5sp566ikNGTJEJ5xwgtq0aaO8vLzIzLokVVdXq2XLlpo+fboGDx6sV155RTNnztTQoUMdLTzgFjcGcbcTB7c/H3ATd7n0Nz/Xvc2JuhfFnbgvWbJEkydP1tNPP62vv/5al19+uTp27KiVK1eqpqZGe/bs0ZYtW1RQUKBWrVpFZuJPOOGEBsk9YLN0dWSHDxZud55ufz6QCOLWLqYmyMTRd0yto1h44KIyinvR+WmnnaYhQ4YoJydHl1xyia688kpJUjAYVElJicLhsIYPH67s7GwNGzZMY8eO1axZs1RXV6eJEyc6vgOAl7k9WFh/6S/4HjFshljrgboym/XtyY1M22HcOTVNrA92w6TreFJvAIB4NTZ22D6mxFJ+k++cumXnPse3eeZJOY5vMxpuwJQmNjdU06Sz43Oj3mz+GhIwDe0pMW4dN6/Ulylj/rGO55F/i/eY21xHXrgcJDPuMbD9DBkAvIZ+GXBHU23P5Bn3T3bud3ybZ5zUwvFtRsOMewziGRxsPhMF/IS2ml5OH+9Y+2XqGSaxPR5tP2H2wuUgmXEHACAGtictbnL72Ln9+X5i8oz71q+cn3E//URm3CXZf1YKAHCfk2OJnxK/dH1Dkq6xPt11Z1MOE09Z23S7w6p98yJjE3cg3eiMjublY+LlfYP7TIivZMrg5QsASA2PTSrqyqaTvHjKWrV8ulX7dhQPXMjd2MTd6sBASqVqQCTmjublYxLvvqUi7mLZZioTQBOSS68yoe2kswymxVJTM8OHHxsT6spNzLjbhTXuDmDtHAAAaIof8gWT17hv+0eN49s87YRsx7cZjbEz7iZp6uzS643QRLaf8btdfrc/3wR+unaxn5lYb6mYtUzXfpp4PE3U2HFyIl9ItA6cqDvbZ9y5qkyK2Dbjbiubz/zdLrvbn4/0or7hNcQ0GtNUbNh8Hfe/73J+xj2/LTPuSBObO+1Ul51vWZxl8wyNRH3bzO3Yc/vzG+OHmHbz2Df22abGw+G8HBse+G0qibvf2NBpmMDLHZcbOJ6wmQ1XZ8HR3Dz2jX028YBksVQGAIAYsLwkcW4fO7c/309MXiqzvcr5pTKntjFoqcyBAwc0atQolZSU6KabbtKiRYu0bds2FRcXq6SkROPHj1d9fX3k/bt27dLVV1+tmpqGB2bLli26+OKLj/p7NMwMA4B3eKFPJ/GzV6J1x4/Yj+aHfTRZ1Bn3F154QRs3btSYMWNUVVWlG2+8Ueeee65+/vOf65JLLtG4cePUo0cP9ezZU2+//bYefvhh/f3vf9f777+v7Oxvz0BCoZBGjBihjz76SG+99Vbk79Ew4w4AcAIzrXCDn+PO7Bn3Wse3eWqbLMe3GU3UGfdrrrlGw4YNizzPzMzU+vXr1b17d0nSFVdcoffee+/bDWVk6A9/+INat24deX84HFZpaalGjBihnJycVJQfPsUZPwB8hz7RLF5N2rkcpPuXg4yauOfm5iovL0+hUEi/+tWvdOeddyocDivwfyXNzc3Vnj17JEmXX3652rRp0+DfT58+XVdeeaXOPffcFBUfSC2bOygA3k2gjmTzftLP2sPmOPOKJq8q8/nnn2vAgAG6/vrrdd111ykj47t/Ul1drVatWjX6b+fPn68XXnhBwWBQO3fu1KBBg5wpNZAmdFIAkFqH97Mk8eazeVz0wuUgo65E+uqrrzRo0CCNGzdOl112mSTp/PPP17Jly3TJJZdo6dKluvTSSxv992+88Ubk/3/0ox/pmWeecajY8DubO45Y+Hl9JAD/ot8Doos64/7kk09q9+7devzxxxUMBhUMBnXnnXeqrKxMRUVFOnDggAoLC9NVVsA3GLwAZzCDC+AQL6xx5zrugIGYcU89jrE/2FDPNpQxnTge6RXP8Y7lvSZfVeZ/vzng+Da/d3xzx7cZDYk7AMCzbEgCbSgjECsS99Rq8sepNuCrUCBxtB/AXSTtQJp44Nepnkjc6fSA2B2ZqEdrPyT1/hNrnRMbiAfxAjiDpTLga1oAnkX/BjirqTZl8lKZL3Y7v1Tmn1qxxp3EHTAESQ8AIB4mJ+5f7nE+cT+5JWvcgaPwNas7SNoB/7H9tvaAl5G4wwokkADclkgya2MCXLV8On0uGmVjTB8SSMF/6UbiDgBADBJJZkmAv2NzwgeYwtjEnQYOADAJ41JyOImB67gcZOrQwAGAZDFZTh4/xqXE+S2OG9tf24/DsS5YYNM+eSBv56oyAJAMrrwDmIm26Q6TryrzVcj5BPPEvPTusLEz7n5g01kqgGNLJjGgDwBSh6TdWV642lAg4Pwj3UjcXUSnAvgbfQCQOk4nmfFsLxUJrttJM1cbMkPUpTIHDhzQ6NGjtWPHDtXW1uq2227Tv/7rv0qSJk+erA4dOqi4uFiS/n979x0XxbXFAfy3SFOKgEhsKEVEScQCPmMUeyE2VIooYC+I2EATK7ECFozGGjSIAQUkYsUSkURFY0MROwgKxoJUpclS5v3BZyZgrOEOy+L5vg+fFxY/Z+/Ozs7euffccxEYGIjIyEgAQI8ePeDu7o7Xr19j3rx5yMzMhJqaGlavXg0dHZ0PNopSZQghhNQ0lHohHjq28kOed07Nyi9lHlNHrQ7zmO/z3hH3w4cPQ0tLC3v37sWOHTuwYsUKZGVlYdKkSYiOjhb+3ePHj3H48GGEhoYiLCwMMTExuHfvHkJCQtCqVSvs3bsXw4YNw9atW0V/QeTzIOuRB0KIfKDFqfLhczm29N1Fquq9HXdra2vMmjVL+L1OnTrIz8/HjBkzYGNjIzzeqFEj7Ny5E3Xq1IGCggJKSkqgoqKC2NhYWFlZAQC6d++Ov/76S6SXQWSNLkbio2NMWPsczqnPpUNI5EN1n4+fw2f8U9T6HHc1NTWoq6sjLy8PM2fOxOzZs6Gvr4927dpV+ndKSkrQ0dEBx3FYvXo1zMzMYGhoiLy8PGhoaAixcnNzxXslRKaq+2L0OX4Zf46vWWyf+5fa53BOif0ef+7nEKnZPofP+Ofmg4tTnz17hjFjxsDGxgZDhgx5578rKirC3LlzkZ+fjx9++AEAoK6ujvz8fABAfn4+NDU1GTWbkM8PdRDYk+d6xKRmoI7Rx6HPlmzQca993ttxz8jIwIQJEzBv3jzY2dm9899xHAc3NzeYmppi+fLlqFOnPFG/Y8eOOHPmDADg7NmzsLCwYNh08rn6HC5Eb3uNsugg1OZjXVOOMSGfA/psycbHHHdZV8upTrUhVea9VWVWrlyJ48ePw8jISHhsx44dUFVVxaZNm6Crq4tRo0bh1KlT8PDwQPv27YV/5+HhgdatW+P7779Heno6lJSU4Ofnh4YNG36wUVRVhhBCCAtUrYQQtuS5qkxOIfuqMlp1q7eqDO2cSgghhBBCPogfcZfXjvvLwjLmMevXrd4tkWgDJkIIIZ+V/zrdL1aagLynH5DPh7xvwlTrU2VkhUbcCSGEEHFQ+hD5Lz72vKnJI+6vXrMfcddUpRH3j0IjFISQqqBryOeB5ftcW84Z6rSLr7acKxXx5408vzaJCD/VTW477nThIbWZPF8Y5QVdQz4P9D4TWaDzjohFbjvuRDaoQ1k96KIvvg+dy3Su1w7y8D7KQxsJqRVqwZA7ddxroJp8EacOJfmQmnz+VvShc5nOdfImsc4JOteIvJCX6/u7SET4X7W/hpq6OJUWzxBCCKlJ6HuJkA+ryYtT84rYd3nVVaq3815jR9zp4kgIIf+dvI+MscLyOND3EvlYtfXzJ++vqzaUg6yxHXdCSGWyvmDK+vk/V//1uFMnsxwdByILtfW8y76ymb4LZKzGpsoQ8j40ZU1I1X0On6PP4TUSUpPU5FSZAin7Lm89ZUqVeSf+Lq+m7XpH/jt6Tz4eHSs26Dj+gzq0n4bOHfnyrverOt5HWT73p/qUNml3cq+Rr+FzUutG3Gl0hRB26PNE5J08nMPy0EZCPlaNHnEvFmHEXYlG3AFQXichNQF9nojYaPSuej9ndLxJVcnzOSSLcpBlZWXw8vLCyJEj4eLigpSUlEp/j46Ohq2tLUaOHIl9+/Z9MF6N7bh/yoVMnk8i8t/U9ve8tr++moCOcc0gTzeHteGced/xltXro/QL2frUVBnyaaKioiCVShEWFgZPT0/4+voKfysuLoaPjw8CAgIQFBSEsLAwpKenvzdeje24f8rJ8a4LEZ1gtZc8fdn/F7X99dUEdIw/P1X9Tqjt58zHvD4x1phlX9ksyrGlPsDH+dRjL8+fA1mUg4yNjYWVlRUAoH379rh165bwt6SkJDRv3hz169eHsrIyLCwscPXq1ffGq5GZSKqKQOH1qp8YLGIQQgiRXxW/B+g7oer+6zGUxbGn95s9eT+mYuTfh4WFISwsTPh95MiRGDlypPB7Xl4e1NXVhd/r1KmDkpISKCoqIi8vDxoaGsLf1NTUkJeX997nq5Edd0IIIYQQQmq6Nzvqb1JXV0d+fr7we1lZGRQVFd/6t/z8/Eod+bepsakyhBBCCCGEyLOOHTvi7NmzAIC4uDi0atVK+JuxsTFSUlKQk5MDqVSKq1evokOHDu+NVyPLQRJCCCGEECLvysrKsHTpUiQkJIDjOHh7e+POnTsoKCjAyJEjER0djS1btoDjONja2sLJyem98ajjTgghhBBCiBygVBlCCCGEEELkAHXcCSGEEEIIkQPUcSeEEEIIIUQOUMedEEIIIYQQOUAddwIACAgIQFZWlmjxr1y5Uunn+vXreP78uWjPJ4a8vDzcv38fBQUFosQvKytDaWkprl69CqlUyiTm06dP3/nDUnFxcaXfU1NTmcavDmK9vwkJCRg9ejSGDBkCf39//PHHH0zjA8DJkydRUlLCPC75eG++r8eOHWP+HI8ePcKZM2fw/PlzyGtdibKyMlk3gRC5ViM2YIqKisJff/2F3NxcaGpqwsLCAtbW1pB8zF6yH5CVlQV/f3+oqKhg3Lhx0NbWBgBs3rwZ7u7stkNOSUnBiRMnhA7MixcvsHz58irHLS0tRWlpKTw8PPDjjz+C4zhwHIfJkyfj119/rXJ8Xt26deHm5gY9PT3Y2tqie/fuTI4/b8OGDcjIyMCXX36JO3fuQElJCVKpFPb29pg0aVKV42/eXHk3NyUlJTRq1AgDBw6EkpJSleOfOHEC27dvR2lpqXBuurm5VTkub+3atdDX18fTp09x+/Zt6OrqYvXq1VWOO2fOHABATk4O8vPzYWJiggcPHkBXVxcHDhyocnyep6cnNm7cCIlEgtDQUOzatQsnT55kFl+szxdPzPd31apV8PHxweLFi2FnZ4dJkyahV69eTGLzbt68iS1btqBr166ws7ODsbEx0/h5eXnYsWMH0tPT0bNnT5iamqJFixZVjnvlypV3/q1Tp05Vjn/w4MF3/m3YsGFVjg+Ud9ivXbuGyMhIXL9+HUD5dTs6OhoDBw5k8hwAEBwcjFOnTuHly5cYNmwYUlNT4eXlxSx+Wloa1q5di+zsbAwYMACmpqZo164dk9jHjx9HWVkZpFIp1qxZg0mTJmHixIlVjltxt8o3vW9DnI+1detWuLm5wcPD41/fh35+flWOz3v27BmOHj2KoqIi4TGW/ZMRI0Zg6NChGDZsGLS0tJjFJbIh8xH3ZcuW4dy5c/jmm28wYsQIdOnSBRcvXsTixYuZxP/uu+9gaGgIPT09ODs748mTJwCAy5cvM4nP+/777wEA165dw99//42cnBwmcffv3w9ra2ucPXsWAwYMgLW1NQYNGoQmTZowic8bNWoUQkNDMWPGDBw+fBi9evXCpk2b8OrVKybxVVVVcfjwYaxfvx6HDx9GkyZNcOTIEfz+++9M4t+/fx+PHj2Crq4unjx5gr/++gsxMTFYuHAhk/iBgYHYt28ftLS04ObmhqioKCZxebGxsXB0dMT169fxyy+/MJuN4LdibtmyJU6cOCF0qL/44gsm8XldunTBd999B1dXV1y7dg379u1jGl+szxdP7Pe3RYsWkEgk0NHRgZqaGtPYADB37lwcPHgQnTt3xoYNG+Do6IiIiAhmo/ALFy6Evr6+8BlbtGgRk7ghISEICQnB6tWrsXLlShw6dAg+Pj746aefmMRPSkpCUlIS9u/fj2PHjuHZs2f4/fffmY6Gt27dGkZGRlBRUYGhoSEMDQ1hYmKC9evXM3sOAIiMjERgYCA0NDQwbtw43Lhxg2n8JUuWwNbWFlKpFJaWlli1ahWz2AEBAfjmm29w+PBhnDlzhtmsU3p6+jt/WOjduzcAwNHRUdgd80O7ZP4Xs2bNQl5eHnR1dYUflgIDA6GkpARXV1fMmTMHFy5cYBqfVC+Zj7gnJiYiODi40mN9+vSBo6Mjk/hSqVT4kLVp0wZubm4ICgpiPs2oqqqKqVOn4tGjR/Dx8cHo0aOZxHVwcICDgwN+++032NnZMYn5Nq9evUJkZCQOHToEDQ0NLFq0CCUlJXBzc/vX+/NfZGdnQ0VFBQCgrKyM7OxsKCsrM5s2ffXqFXbv3g2g/CI7YcIErF27FqNGjWISX0FBAcrKypBIJJBIJKhbty6TuLyysjLEx8ejWbNmkEqlzNOWnj9/DnV1dQBAvXr18OLFCyZx+ZQeW1tbFBQU4K+//sLKlSuZxK5IrM8XT8z3t379+ggNDUVhYSEiIyOhqanJLDaP4zjExMTg4MGDePLkCYYOHYqsrCy4u7tj+/btVY6fk5MDOzs7HD58GB07dmR2/eQ7t1OmTMHWrVuhqKiI0tJSTJkyhUl8T09PAMDEiRPh7+8vPD5hwgQm8QGgcePGGD58OGxsbACUf5bj4uKYz3rwx5wf+VVWVmYav6ioCF26dMG2bduEGxFW+FhqampQVlautMV7VVQclX7x4gVKSkrAcRyz61vr1q0BAE2aNMHJkydRWFgo/O1///sfk+cAyo8LPzsqBk1NTTg5OeHrr7/G1q1b4enpiWbNmmH69Ono2bOnaM9LxCHzjntZWRmuXr0KS0tL4bHLly8zSW8Ayqcs79+/D1NTU3Ts2BFTp07FtGnTmOexchyH9PR0FBQUoKCgAC9fvmQav2vXrpg5cyaSkpJgYGCABQsWoFmzZszi29nZYejQofjxxx/RuHFj4fF79+4xid+nTx+MGjUK5ubmuHnzJnr37o29e/fCxMSESfzc3FxkZWVBR0cH2dnZyM3NRXFxMV6/fs0kvqWlJTw9PZGWlgYvLy+0bduWSVyejY0NVqxYAW9vb6xduxZjxoxhGr9bt25wdnbGV199hfj4eKGTUVV8WknFjhz/2OnTp5k8ByD+58vS0hIeHh6ivL/e3t7Yvn07tLW1cevWLaYjmbz+/fvD0tISLi4usLCwEB5PSkpi9hx8rOfPn0NBge1kbcUR0tLSUuY3rllZWXj16hU0NTWRnZ3NfMYGKE+dECPdjTd48GA4OTnh6dOnmDx5Mvr27cssNlB+I3Du3DnhxoPljUGzZs1ga2uLJUuWYPPmzTA3N2cWGyifEYqLi0NhYSFev34NfX19prN+np6esLKyYj4SzjMxMUFkZCTatGkj3JgZGhoyi79nzx4cOnQI6urqsLe3h6+vL0pKSuDg4EAdd3nEyVhKSgrn6urKde/enbOysuJ69OjBubq6cg8fPmQS/+7du5yzszOXnp7OcRzHZWZmcgcOHOD+97//MYnPu3z5Mrdnzx4uKiqK+/rrrzlfX1+m8SdOnMhFRUVxL1++5E6dOsWNGTOGafyysjKm8d7m7t27XGRkJHf//n2O48rfC1bPGx0dzfXp04ezsbHh+vXrx505c4b7+eefueDgYCbxOY7jzpw5w+3YsYM7ffo0s5hv8/TpU1HiJiQkcJGRkdzdu3dFiS8msT9fHPfP+xsdHc0k3rNnzziO47jk5OR//bCWm5vLPGZF9+/f5xwcHDgLCwvO3t6eu3XrFtP4wcHBXP/+/Tl3d3fO2tqai4yMZBrikJz3AAAgAElEQVT/xIkTXN++fblhw4Zxffv25S5fvsw0Psdx3MiRIzmO4zhnZ2eO4zjm12iO47gHDx5wx44dE+Uz/OzZM2727NncwIEDuRkzZnCpqalM4+fl5XEcx3EvXrxgGpfjyo99WVkZt3jxYi4zM1N4D1gR472syNnZudKPi4sL0/jr169/6/t57do1ps9DqofMR9z5qWmO41CnTh2hsgbHaCr29u3b6Ny5M9LS0uDs7AwVFRW8fv2aef5hXl6eMH3fp08f5hUFioqK0KdPHwBA3759ERgYyCRut27dAJRXBSksLETjxo3x/PlzNGjQANHR0UyeAyhffHPu3DkUFRUhOTkZv//+O9PFN7169UKPHj2QlZWFBg0aQCKRoHv37lWO++bi4K+//hplZWUYM2YM08XBv/76K1RVVfHq1StERETAysoKCxYsYBb/2bNn+PPPP4XjHxUVxeT4L1++HF5eXhg5cuS/Fm+FhoZWOT7v5cuXcHR0hIKCgvA5YIlfxKirq4uXL1/i4MGDVV68uGvXLixYsOBfCwglEgnTcwcoz2ENDg6GouI/l/SYmBhm8Vu1avXehYBV5eTkBBsbGyQnJ6NZs2bQ0dFhGn/AgAEYMGAAMjMzoampyWxGtyKx0t38/Pz+9dm6e/cujh07Bg8PDybPAQCNGjXCunXrwHEc4uLimK6DSUxMxA8//IDc3FwMGTIEJiYmTBdoq6mpQSKRoKCgADo6Ov+qcvVfPXz4EED5deHIkSP48ssvRRkRDwoKQm5uLp48eQJ9fX1m62D476+kpCQ0atQIUqm0UnGLDh06MHkeUr1k3nFftGgR5s6dW2nqLC4uDgsWLGDyxb93714EBQVh2rRp2LZtGwwNDZGWlgY3Nzd07dq1yvHfVlGgrKwMp0+fZlpRoGLKz/3795nF5b/c586dC09PTzRu3BhpaWnw8fFh9hxA+eKbLl26VErDYen8+fMIDAystCqfRedo//792L59OzIyMmBtbS3cYFZMR2AhMjISQUFBmDRpEiIjIzF27Fim8cU6/nzlFW9vb6iqqjKNXdGFCxewceNG9O7dG3Z2dtDX12can08D4TgOd+/ehZaWVpU77vyNV1BQUJXb9yF//PEH/vzzT9Heg4MHD8Lf37/S54tlKpTYHbsrV65g2bJlQtWgJk2awN7enll8QLx0NyMjIyZxPkSsylYAsHLlSlErK3355Zf45ZdfoKenhzlz5qC0tJRJ3Io33RVTb1jffJ88eRLbtm1jXtXqbd9fCgoKlVKTiRyS4Wg/x3H/TC9+7OP/Nf60adO44uJi4XFbW1sm8Z8+fcpFRERw1tbWXEREBBcREcEdOHCAu3PnDpP4vNu3b3MjRozgunXrxtna2jKP/+bxdnBwYBp/3LhxTOO9adCgQdz58+e5pKQk4Yel8PBwpvHeNHLkSO7vv//m5syZw3Ecxw0dOpRpfLGPv6Ojo6jxOY7jioqKuGPHjnGTJk3ixo4dK9rzlJWVcZMnT2YW78CBA9y3337L9e7dW/hhbfLkyZWub6wNHDiQe/ToEVdUVCT8sDRmzBju0aNHnLOzM5eZmckNHz6cafzRo0dz2dnZnLOzM/f69Wvm8XmvXr3i7t27x+Xn5zOPXVxczF27do27fPkyd+nSJe7IkSNM44uZ6sPH4lNAWKeyJCUlcbm5uVxxcTF3+vRpITWWldevX3O3b9/mOI7jTp06xUmlUqbxR44cyRUVFXHOzs5cWVkZ8/NT7O8vUr1kPuJuamqKBQsWwMrKChoaGsjPz8eZM2dgamrKJH7v3r0xbdo0tGrVClOnToWVlRXOnTuHr7/+mkn86qooYGZmhv379zONWZGxsTHmzZsHc3NzxMXFMR9RFnvxTePGjfHNN98wi/emTp064eeffxatjnjnzp3h7OwMPz8/eHt7o3///sxiA+If/3r16sHb2xuGhobCwkXWJdPi4+MRExODzMxMDBgwgGnsihtepaen4++//2YWe8eOHdi2bZsos018femMjAwMHz4cJiYmwvvLss60vr4+k7rt7yNmyUwFBQVoaWlBIpFARUVFlJKcYo2a8tzd3VFcXIwXL16gtLQUenp6GDx4MLP4Yla2Eruy0qJFixASEgLgnxKOLM2bNw9dunSBmZkZHj58iOPHjzP9fIlV1So8PBz29vZISUn5V3owyzQrUr1k3nFfunQpoqKiEBsbi7y8PKirq6NXr17o168fk/hTpkzB5cuXERMTgyZNmiAzMxMuLi7MV1KLVVGAz0F/G5Y5rCtWrMDZs2eRmJiIgQMHMs8jvnv3Lu7evSv8znqqsUGDBvDy8oKZmZnQcWHZcfz+++/Rq1cvXLt2DXp6esyrEs2ZMwdz5szBy5cvMXfuXOal3sQ+/nyuZGZmJrOYFQ0cOBCtW7eGvb29KFVZrK2thf9WVVVlsjkMT8xOL6uyuR+iqqqKSZMmVbrxY/nFL3bHrnnz5vDz80NOTg78/f2Z74MBlK9p2LdvHyZOnAg3NzfY2toy7bjn5eUhODgYixYtwpIlSzB+/HhmsQFxK1uJXVlJ7IGDtLQ0obTw5MmT4eLiwiw2IF5Vq0aNGgGovnQrUj1k3nGXSCTo168fs4762/zvf/9jWnP1bWJjYzFv3jy4uLggKCiIWY4yy875+/Bl9ho2bIjc3Fwmi/MqEjvPly+NmZGRIUp8seuIi52D++bxrzjCzIK7uzsuXLiAv//+G+bm5kxH84HycmYSiQSpqalC2U+WWC7EfpOYnV4LC4t/7axcVlaGKVOmML0x69GjB7NYbyN2x27ZsmUIDw+HhYUF6tatK8peA2Lv9cAvPC4sLISqqiqzBZg8JycnODk5AQCzDbaeP3+ORo0aIT09Hba2tsLj2dnZTHfwFHvgAChfqGpoaIjU1FRm+4/wPDw8cPbsWZiZmcHY2JhZ/r9EIkFMTAwaNmzIJB6pGWTeca8txJpmfNtWyzyWU3Vubm7Q09MTpvPf9ZyfaubMmfjpp5/eOnPA4qaE/2IYNGhQlWO9DydyHfENGzYgODgYM2bMgKurK0aNGsW04x4aGopdu3YJG5QoKSnh5MmTzOKvX78ez58/R1JSEpSUlODv78+0ctPFixexYcMGGBsbIzExEe7u7sxq0QPlVUcq7jKqqKiIxo0bY968efjyyy+rFFvMTm91LZ5muWfE26xZswb9+/fHnDlzUKdOHebxvb29Ky00/O6777BmzRqmzyH2Xg/9+vXD5s2b0bp1azg4OAgbqlUVf43u0qXLv459Va/RFSsrVdzvgdWMX3Vd/xcuXIjZs2cjMzMTenp6WLZsGZO4fDUrHsuqVkB50YN3ed9sPqnZqOPOiFjTjNU1Fc5xHNatW8c8Lr91eXh4eKUcX1Ybw1RXyT13d3dERUVh6NCh6NOnD9PZCED8HNx9+/YhKCgI27Ztg7W1tbDLLCuxsbHYs2cPXFxcMHz4cCHflJXAwEBERERATU0NeXl5GDt2LNOOe+fOnWFtbQ1LS0tcv34d4eHhsLW1xcqVK6v8WoYNG4abN29WqsjCSnXtrMwfA47j8ODBAzRt2hSdOnViFt/GxgbR0dHYvHkzWrRogf79+zNJ19uzZw+2bduGnJwc/P7778LjrNcgAeUpFNevX0ebNm1gZGTEPNeaHw0Hym8GDQwMmMTlr9GGhobYu3cvk5g8vrLS+PHjKx0PVuWS37wxAMrPUdbX/zZt2sDHxwdmZmaIiooSdlStKv57MC4uDnXr1kWHDh1w8+ZNlJSUMPmOeVd1OFY7yxLZoI47I2JMMwL/bKuck5ODmJiYSls6s0z/MTU1xY0bN9CmTRvhMRZ51gkJCUhLS8O6devw3XffCVP5fn5+OHToUJXjV1fJvXbt2kFdXR1t2rQBx3HMR1HFzsHV1taGnp4e8vPz0blzZ+HLmpXS0lIUFRVBIpGgtLSU+c6aEolEuJlRV1dnuh07UD4Nzi9u7ty5M7Zu3YouXbpg8+bNVY49Y8YMZGZmVprNYtnpBcRfPF1x9kQqlWL27NnMYgPlKT8GBgZo3bo19uzZg2XLljHpuPPX5e3bt8PV1ZVBS99typQpCAkJYbJ/xNvcvXsXYWFhlW4AWZbtVVBQwPTp0yvliVc1patiueS4uDgAbMslV7z+Z2VlITU1FQYGBkzTcIDycsliLE719PQEAEycOBH+/v7C4xMmTKhy7Ip++ukn7N27V9hN3MDA4L2j8aRmo457FYmdClLxeQwMDJCQkAAVFRXm+ZOXL1+ulOfLasv6V69e4dixY8jMzMTRo0eF2KxzxMWuM81fuNu0aSNKVYGKObj16tXDihUrmMUGAA0NDURFRUEikSA0NJT5lvJjx47FiBEjkJWVBXt7e4wbN45p/ObNm8PX1xeWlpa4evUqmjdvzjS+srIyQkJC0KFDB1y/fh3Kysq4desWk3rQGRkZTDejehuxF09XVFpaisePHzONaWNjAwUFBQwZMgTLly9Hq1atmMZ3dHTE0aNHKw18TJ06lelz1K9fH7t3767U8WWZjjB//nw4OzsLCw5Zq5iDzkrr1q2Rk5MDFRUVYd2LRCJhntqyd+9e7N69Gy1btsSDBw/g5ubGdEZO7MWpWVlZePXqFTQ1NZGdnY2cnBym8c+ePYuzZ8/C29sb48ePZ5bqQ2SDOu5VxI9cHj58mPmCuTctX74cCxYswKpVqypNm7Jw+PBhpvF4lpaWsLS0xObNm5nulPomMUvuAeJfuOvUqYMvv/wSLVu2BADcuHGD6ajsypUr8fjxY3h6eiIgIABLly5lFhsAvv32W3zzzTdISUkRZedLb29vhIWF4cKFCzA2NhZGqlhZt24dtm/fjujoaJiYmGDNmjWIj49nskiS3/SN5U6UbxJ78XTFDmhJSQnTiiNA+WcqJiYGZ86cQVpaGrp16wYrKytm8cUe+ADKZ7Xu3buHe/fuCY+x7Ljr6uoy3zSqouHDhzOPWbFcMutZuIr27duHw4cPQ0VFBYWFhXB2dmbacQfEXZzq6uoKW1tbqKurIy8vD97e3kzja2lpQVlZGfn5+WjRogUKCwuZxifVizrujEycOBHNmzeHg4MDkx1Z36aoqAiFhYXC1s4snT59WphK4zgOOTk5OHLkCLP4Fy9eFLXjXh11psW8cLu7uyM7OxuNGzcWcjRZdtwVFRVx6dIlPHz4ECYmJujYsSOTuPxU9duwmMa/cuWK8N+tWrUSRmLj4uKYHh9tbW106dIFurq6MDQ0hLa2NrN0qGvXrqFXr16VbmZYV4viF0/n5+eLsng6JiYGBQUFqFevnig3IYMHD0b//v1x6dIl+Pv749ixYzh37hzT5xBz4AMoP98TEhLw4MEDGBoaVko7ZKFp06bw9/evVJ1IXhYY7tixAzt27Ki0sy/Lz0CDBg2EhbWqqqrMU2XEWpzKGzBgAAYMGIDMzExoaWkxX6DdqFEj/Pbbb6hbty78/PyQl5fHND6pXtRxZ+TAgQO4efMmIiIi4Ofnh379+mHatGnM4js5OWH37t1o27YtevbsyazjxduyZQuWLFmC0NBQdO7cGefPn2caXyqVYtiwYZWmkVmmmohdZ1rsC3dmZqao6RQeHh4wMjKClZUVrl27hgULFjBZjMznqfJpJh07dsTNmzdx8+bNKsfm4wJAamoqiouL0bZtW9y5cwdqampM1zX4+fkhJSUFHTt2xMGDB3H16lXMnz+fSWyW1Xvexd3dHadOnYKNjY0oi6c3b96MvLw8zJ8/H6tWrcJXX32FKVOmMIvv6uqKp0+folu3bpgzZw7z6xsg7sAHUJ5nffToUZibmyMgIADffvst0/0AiouL8fDhQzx8+FB4TF467vyNmBgzHUD5jeuwYcPQoUMH3L17F8XFxcKsHIvvmXbt2jFZk/UmFxeXd1ZwY7m4dvny5Xj27Bmsra1x4MABbNiwgVlsUv2o486QiYkJ2rdvj9TUVFy9epVpbFVVVYSFhUFDQwOKiorMd6XU1tZGhw4dEBoaihEjRiAiIoJp/Llz5zKN9yax60yLdeHmiZ1OkZOTI7wHffv2ZZZKwacz7Nq1C5MnTwZQvtCQ1eYw/KLIKVOmYOvWrVBUVERpaSnTTiNQPrLP3ziNHTsWDg4OzGLHxcUhIiKi0sLRX375hVl8oHxxqrGxMR4/fozjx48zH3GMjo4Wrgk//fQTHB0dmb4Hs2fPxhdffIHHjx+LUnqSH/jo2rUrevTowbxcJgAcPXoUe/bsgaKiIoqLi+Ho6Mi04y72iL6YmjZtWmm0nbXhw4fj1atXqFOnDi5cuAAXFxeYmZkxi9+7d+9KHWx1dXUm3wf8ANCWLVvQp08fWFhYID4+Hn/88UeVYwPl61He3OfB3t6e+T4PpHpRx52RBQsW4MaNGxgwYACWLVvG/Mtn8+bNCA8Ph46ODtLT0zF9+nTs27ePWXwlJSVcuXIFJSUlOHfuHNLT05nFBsrTHMSoinPz5k20bdtWtA0mqmvxsdjpFC1btkRsbCwsLCxw//59NGnSREiLYlE9qKCgAH/99Rfatm2L69evM98cpuL5WFpaynxxbUlJCcrKyqCgoCCkKrGycuVKjBs3DidPnkSrVq2Yb34FlJc93L17N0xMTERZnCeRSCCVSqGsrCycNywlJydj1qxZotXpb9KkCQYMGACgfD3GnTt3mMXmcRwnbJKkpKQEJSUlpvHFHtEXU3FxMYYMGYJWrVoJny2WM64RERGYOnUq9u7dCw8PD4SGhjJdIH/ixAkA5e/xrVu3hN+rit/RNCMjQ5i97NevH7PZxOra54FUL+q4M9KvXz+sWrVKtAU4ampqQqeuYcOGzKccly1bhuTkZEybNg0bN27EzJkzmcYXa3EY31l8W2krFtPI/OJjsXewFTudIjY2FjExMVBSUhI61QMGDGBWPWjVqlXYuHEjVq5cCSMjI/z4449VjlmRnZ0dBg0ahFatWuHBgweYMWMG0/gDBw7EqFGj0K5dO8THxzMpVcfT1NTE4MGDcf78ecyYMQPOzs7MYvPCw8Nx5MgR0RbnOTo6Ch2v5ORkYXaFld27d4tSp//q1at48OABAgMDhVmgsrIy7NmzR6hyxYqFhQVmzpwJCwsLxMbGCrt5siL2iL6YWJ8vbyopKUGnTp2wfft2DBo0iHk9+oqDGxYWFkw3l+OFh4fD3Nwc169fZ/b9WF37PJDqRR13RnR0dLB06VLm0+H8BaK0tBRTp04VptJYjJJWtGbNGmEEZNOmTUxj88RYHMZP17+5EJLVBhPVtXOt2DWa9+3bV2lTp2fPnjGtwGNsbIw5c+YgNTUVpqam0NXVZRYbKE91sLGxQXJysihVayZMmIBu3bohOTkZdnZ2TMsRSiQSJCYmorCwEMnJycxnswDxF+fZ29ujW7duePHiBRo2bMh8nwGx6vRramoiIyMDUqlUOO4SiQTz5s1jEr+i77//Hn/++SeSk5Nha2vLPH1P7BF9MZmZmWHHjh1IT09Hz549YWpqyjR+cXExfHx8YGlpiYsXLzIp41qRn5+f8D3w4sUL5gN069atQ0BAAE6dOsV04CM8PBz29vZITU39180GyzVgpHpRx50RsabD+dq3/P8DYLIxyZukUinu3bsHQ0ND4QLF+uZAzMVhYm0wUV0714pdo3n06NFYs2YNTE1NcfLkSWzYsAHHjx9nFj84OBinTp3Cy5cvMXz4cKSkpPxrN9uqeFv1GhY3NhW/kHl8GgWrL7b58+cjMTERLi4umDt3rlBWlKWKi/Pu3LmDkpISpovzKi5OnTlzJvPFqWLV6ecrEdnb2wvrR1jftPLy8vJw6dIlPHjwAM+fP0e7du2Y3kCJPaIvpoULF6J79+64cuUKdHV1sWjRIgQHBzOL7+vri/Pnz8Pe3h5RUVFYu3Yts9jAPyktQHltepalSoHyWfTOnTtDR0cHhoaGqFevHpO4/PfJ3r17MWfOHKiqqopalpNUD+q4MyLWdLgYtXXf5tGjR3Bzc0N2dja0tbWZpVDwnJycEBgYKNriMLE2mODz8PPy8kQdMRK7RrOfnx8WLVqEBg0aQFFREXv27GEaPzIyEnv37sWYMWMwduxY5pu58KkrHMfhzp07zGZUKn4hA+Wjsazzt01MTKCkpISUlBRs2bJFlJszfldQiUSCIUOGMI8v9uLUlStXIjw8XLQ6/SdPnoSqqipevXqFiIgIWFlZvbeU6X+xcOFCdOrUCUOHDsXly5cxf/58bN++nVl8fkQ/KSlJlBF9MeXk5MDOzg6HDx9Gx44dmX/GDAwMYGBgAABM09x4AwcOxL59+/Do0SOYmJgw61jzxKpqxd9gBAQECHH79esnymZbpPpQx52R6pgOF9PcuXOxfPlytGjRAgUFBUy3SwcgLAwDyheHqaurM40v9gYTYo8YiV2jmf+ilEqlUFJSYl4nmI8v1mxNxRGu7t27M9sSnL8xLikpQVhYGB48eAADAwOmo+Jiz0YA5akIW7ZsQVJSEgwMDODm5sZ0tFfsxamurq4ICAhgGrOiyMhIBAUFYdKkSYiMjMTYsWOZP0d2drawMVubNm2YrVv5448/0KtXL4SFhQEoTyV6/vw5wsLCmFcXE1NSUhIA4Pnz53I36jt//nw0bdoUXbp0QWxsLBYuXIjVq1cziy9mVSsAMDc3h7m5OV6+fImlS5eif//+uHXrFtPnINWHOu6MVMd0uJjErloTHh6OwMBAvH79WniM5Yi+2BtMiD1iJHaN5tmzZ8PX1xfNmzfHhQsXMHr0aCapRLzBgwfDyckJT58+xeTJk9G3b19msYHKi4PT09ORkZHBNL6Xlxc0NTXRtWtXXL58GYsXL8aaNWuYxBZ7NgIQf7R31KhRoi5O1dDQwOnTp2FgYCB06iqmB1aVRCJBeno6dHV1IZFImG9QBZSnAqanp6Nhw4ZIT09ntklbTk4OAMjdYFBFixcvxsKFC5GUlISZM2fihx9+kHWTPklGRoaQd963b1/mC8zFrGoFlC/SjoiIwM2bN2FtbY3vv/+eaXxSvajjzkhcXJyQ6hARESF3NVLFrloTEhICf39/0co2VscGE2KOGPn4+ODhw4fC4k49PT2m8T09PbFgwQKUlpbC2tqa+XTyN998gy5duiAhIQGGhoZo3bo10/gVbzJUVFSYbwmekpIipA/17duX6doGsWcjAPFGe3na2trCTFbDhg0RGRnJdJOnrKwsBAYGCr9LJBKm19DOnTvD2dkZfn5+8Pb2Rv/+/ZnF5s2ePRujRo0SKjetWLGCSVx+VkhBQQFubm7C4ywXx4stNTUVISEhcjfSzq9Va9asGeLj42Fubo579+4JaTmsiFnVCiiv2mRvb49Vq1Yxvykg1Y867lV09OhRREdH49KlS7h48SKA8nJjCQkJGDNmjIxb92HVVbVGW1sbTZs2ZRoTqL4NJhYvXoxFixYJI0ZLly5lFhsQP53il19+QXBwMGbMmAFXV1eMGjUK06dPZxZ/0aJFCAkJgbGxMbOYFfn4+KC0tBQcxyEuLo5p1Rfgn4XTdevWxevXr5lWpRg8eDCcnZ2F2Yh+/foxi82rONqbkZHBbLSXt2bNGqxYsQKamppM4/J69OiBSZMmiRIbKK96xG9q89VXX4ly85SXl4eysjLUqVMHUqmU2TkUHh6O3377DUlJSTh79iyA8u+YiruD1nQXLlzAxo0b0bt3b9jZ2UFfX1/WTfoo1tbWwrqXS5cuCTdlrKoe8Xr27ClaVStAvEpxRDao415FVlZW0NPTQ05OjpBvqKCgIDcXJrGr1vA3BlKpFBMnToSZmZlwx8+iaofYG0xU3DGP4zjo6OggIyMDnp6eTKuyiJ1OoaCgAC0tLUgkEqioqFQqDVkVubm50NDQQL169eDt7Q1DQ0NhVI1l/u3atWuhr6+Pp0+f4vbt29DV1WWaYzpmzBjY2NgIGxix2MeAL8WWlpaGhg0bIi0tDSoqKsjJycGmTZvQtWtXdOzYkUHrgVmzZsHR0RHq6urIz89nNtrLMzExYbJh2rucPXsW48ePZ772grdv3z4MHToUgDgzHgCwdetWhIeHo0GDBsjIyICrqyuTdDcbGxt06dIFP//8s7AIWUFBAQ0aNKhy7Ori5eUFqVSK06dPY/ny5SguLq40w1JTRUdHV8vz8AMfrDvspHaijnsVZWVloWHDhliyZEmlx1mXOxSL2FVr+BsCvjazpqYm1q9fz2xxodgbTJw4cQIcx2HZsmVwdHSEubk57ty5w3yDD7HTKZo3bw4/Pz/k5OTA39+fWR1uV1dX7NmzB02bNoWmpiYyMzOZxH1TbGws5s2bBxcXFwQFBTFfXDh06FB0794djx8/RrNmzaCtrV3lmHz1GCMjIxgZGVWqAlJSUoIffvgBR44cqfLzAEDXrl1x+vRpZGVlMa9xD5TfzI8cObJSFR6W+wxkZ2fDysoKzZo1g0QigUQiERbrsSCVSjFs2LBKN5asU020tLSEzrSuri6zBfjKyspo1qwZvLy8cOvWLWH36djYWAwePJjJc1SH+Ph4xMTEIDMzs1KxAnlw+vRpodwwx3HIyclh9tkFIPrAB6ldqONeRV5eXm/NGZNKpUy/eOQVf2NgZ2cHX19ftGzZEpaWlpg/f76wkyELXbt2xY4dOyptYOTu7l7luHwH+vHjxzA3NwdQXsGj4iJSFsRe3Lls2TKEh4fDwsICdevWZTYiq6qqCltbW6SkpFRKk5FIJEyOP6+srAzx8fFo1qwZpFIpsrKymMUGgMTERPzwww/Izc3FkCFDYGJigl69elUpJl8J5103xyzWMYwcOfKdOassrz98RRYNDQ1mMStiuZD2bebOnStqfKB8cGLixIno1KkTbt++jdevXwszjixmF2fMmIHi4mK8ePECpaWl0NPTk5uO+8CBA9G6dWshz1rebNmyBUuWLEFoaCg6d+6M8+fPM41/4Swdf5MAAA8BSURBVMIFdOjQQRj4qPg9RsibqONeRUFBQQDKF18GBgYKO6fyO9yRcoqKimjZsiUAQF9fn/kipVmzZqFLly6ibKwClFe92LBhA8zNzREXF8c8X9/Z2bnS4k7WueKKioqiVDrasWMHXrx4AS8vL1ErRdjY2GDFihXw9vbG2rVrma8fWblyJXx8fLB48WLY2dlh0qRJVe64fwiLOtxibL3+Nrq6uqLUx+aVlJTgxIkTlXaeZlmSVuydO4HKKYb8Zk8s5eXlITg4GIsWLcKSJUuYDnyIbc+ePVBSUsKTJ09QUFDAvA662LS1tdGhQweEhoZixIgRwp4GVcWvX6hXrx7OnTsHoHyQouIGaoS8iXqXjISHhyMoKAjbtm2DtbW13FWVEVuTJk2wfv16tG/fHvHx8cyrpqipqWHOnDlMY1a0bt06HDhwAGfPnoWRkRFmzZrFNP7OnTsxadIkGBsb4/79+3BwcMCBAweYPocYFBQU0KhRI/j7+4v6PE5OTvj222/x9OlTzJgxQ5RFki1atIBEIoGOjg6zNQBi428gU1JSRO34qqqqirJGhff999+jV69euHbtGvT09JinGoq9DwMgftohn/9fWFgIVVVV4b2WB5cvX8a2bduEqlYSiaRShZyaTklJCVeuXEFJSQnOnTvHrDRnbVi/QKofddwZ0dbWhp6eHvLz89G5c2f89NNPsm5SjeLj44OQkBCcOXMGxsbGzC/aJiYmiIyMrLSBEcs60PXq1YOTkxOzeG9KSEhASEgICgoKcPDgQWY7v9YWv/32G3bu3AljY2MkJydjxowZTEeA69evj9DQUBQWFiIyMlK06iliEbvjK/bsg6qqKqZOnYpHjx7Bx8cHo0ePZhpf7H0YqkOfPn2wefNmtG7dGg4ODsw3sRPTrl27sG/fPkycOBFubm6wtbWVq467ubk5SkpKMG3aNGzcuJHZjDG/foH1YnJSu1HHnRENDQ1ERUUJi6pY5+DKOxUVFYwbN060+Hfv3sXdu3eF31nXgRabr68v5s6di6ysLOzfv1+0yhfyKjQ0FIcOHYKKigoKCgowduxYph13b29vbN++Hdra2rh165bc5eGK3fEVezSZ4zikp6cjPz8fBQUFomyQJM87dwLAyZMnhb0GevTowbyWuJgUFBSgrKwsLDxmvU+IWCqW4uRTPUtLS6GqqirjlpHPGXXcGVm5ciVSU1Ph6emJgIAA5nW+yfsFBQUhOztbqAoiRmUNMVRcXFhcXIz79+8L+du0uPkfWlpawroRVVVVZiPiT58+Ff67Yme3oKAAWlpaTJ6jOvAd34KCAtE6vmJyd3dHVFQUbGxs0LdvX9jY2DCNL+87dwLlgxHTp0+vVHmEZbqSmCwtLeHp6Ym0tDR4eXmhbdu2sm7SR6FUFlITSTh5nDMk5A3Hjx/Hhg0bYGxsjMTERLi7uzP/8hfDkydP3vk3MTaskjceHh6QSCR4+PAhSktL0a5dO9y5cweqqqpMcpT5kms5OTnIz89Hq1atkJiYCF1dXblYY8C7cuUKEhMT8cUXX2Dx4sUYNmyYXG1rfuDAAfj7+wvVNCQSCU6fPs0sflRUFHr37i2XI+28t52PYs+EsJKWliZsMBcREYFNmzbBzMxM1s0iRC5Rx53UCiNHjkRAQADU1NSQl5eHsWPHYv/+/bJu1kd79uwZjh49yrycpby7fPnyO//GckOg6dOnY/Xq1VBXV0dBQQE8PDxEL1FI/jFo0CBs3bq1UlUoluliy5cvx5UrV+Ru587aYvz48Zg6dSr27t2LAQMGIDQ0VKjIRgj5NJQqQ2oFiUQiVALhN3uSJ2KXs5RXfOc8JycHMTExwuYzL168YNpxf/bsmbDYr27dunjx4gWz2NVh8+bNCA4OrlSGNiYmRoYt+jT6+vpo0aKFaPHldefO2qKkpASdOnXCzz//jEGDBjHfwI6Qzwl13Emt0Lx5c/j6+sLS0hKxsbFo3ry5rJv0ScQuZynvZs6cCQMDAyQkJEBFRYX54jYrKys4Ozvjq6++Qnx8PIYNG8Y0vtj++OMP/Pnnn3K7aE5VVRWTJk2qVBWKdf62PO/cKe+Ki4vh4+MDCwsLXLx4EaWlpbJuEiFyS34T/gipwMHBAfXr18eFCxcQEREhaulGMfDlLJOTk/Hw4UPmO7PWBsuXL4ehoSF27drFbPFleHg4gPJNT3R0dHDq1CnUq1cPGRkZTOJXlwYNGsj1pm89evTAoEGDYGRkBENDQ6alXIHynTu3bNkCQ0ND7Ny5E1OnTmUan7yfr68vDA0NMWXKFGRlZWHt2rWybhIhckt+r/SEVODr6wtfX1+0bNkS48ePx/z584XSafJA3stZVoeioiIUFBRAIpEwq1PeqFEjAICRkRGMjIxEr1fOGr94NyMjA8OHD4eJiQmA8vPHz89Pxq37eGIvsnR3d8fGjRtx7do1hIWFyc3i9drCwMBAKF8p5g68hHwOqONOagVFRUWhzq6+vr7cVY+ghVrv5+TkhN27d6Nbt27o2bMnOnbsyCSulZUVAPmpzvEmR0dHWTdBLuzevRsRERGVFq9Tx50QIo+o405qhSZNmmD9+vVo37494uPjoaenJ+smfZLevXsLub1A+YZeBw8elGGLapaXL1/i0KFDKCwsRGFhIW7cuCHrJtUI/ALdzMxMbNu2DY8ePYKJiYlQc5qUk/fF64QQwqNykKRWKCoqQkhICB4+fAhjY2M4OjrK1e6jUqkUQPlGOrdu3cKJEyewaNEiGbeq5hgxYgQ2bdqEhg0bCo/J0/srNhcXFwwcOBAdOnRAbGwszp49i59//lnWzaoxvvvuO+jo6MDS0hJXr15FTk4OfH19Zd0sQgj5ZDTiTmoFFRUVjBs3TtbN+M8qdkItLCywfv16Gbam5tHW1qYNqT5g1KhRAIDWrVvjxIkTMm5NzeLt7Y2wsDBcuHABxsbG8PT0lHWTCCHkP6GOOyE1gJ+fn5Aqk56eLnc5+mLhb2CkUikmTpwIMzMz0coFyjMjIyMcPnwYnTt3xu3bt6GlpSVUJmJdoUUeKSoqyl2lKUIIeRtKlSGkBvjtt99Qp04dAOWzB1ZWVtDQ0JBxq2Tvbdu88+R1QakYXFxc3vo4VScihJDahTruhNQAEyZMQEBAgKybQeRYbm4unjx5An19fWEhJiGEkNqFUmUIqQE0NDQQFRUFQ0NDIU2GUhzIxzp58iS2bduG0tJSWFtbQyKRwM3NTdbNIoQQwhiNuBNSA7i4uFQqBwmAUhzIR3N0dMSvv/6KiRMn4tdff4WtrS0iIiJk3SxCCCGM0Yg7ITLE12/n75+VlJRQXFxMdabJJ1FQUICysjIkEgkkEgnq1q0r6yYRQggRAXXcCZGhEydOgOM4LFu2DI6OjjA3N8edO3cQEhIi66YROWJpaQkPDw+kpaXBy8sLbdu2lXWTCCGEiIA67oTIEF+//fHjxzA3NwcAmJmZITk5WZbNInLGw8MDZ8+ehZmZGYyNjdGrVy9ZN4kQQogIqONOSA2goaGBDRs2wNzcHHFxcbTZEPkk0dHRuHnzJmbNmoWJEydCSUkJ3bp1k3WzCCGEMEaLUwmpAQoKCnDgwAEkJibCyMgITk5OQl13Qj5k+PDh2LlzJxo0aIDc3FxMnjwZoaGhsm4WIYQQxmjEnZAaoF69erSzI/nPFBUV0aBBAwDlsze08y4hhNRO1HEnhBA5Z25uDk9PT7Rv3x7x8fEwMzOTdZMIIYSIgFJlCCFEznEch9OnTyM5ORktW7ZE7969Zd0kQgghIqD5VEIIkXP5+fmQSqXQ09PDq1evcPDgQVk3iRBCiAgoVYYQQuScm5sb9PT00LhxYwD41y68hBBCagfquBNCiJzjOA7r1q2TdTMIIYSIjFJlCCFEzpmamuLGjRuQSqXCDyGEkNqHFqcSQoicGzp0KPLy8oTfJRIJTp8+LcMWEUIIEQN13AkhpJbIyclB/fr1KcedEEJqKcpxJ4QQOXflyhUsW7YMpaWlsLa2RpMmTWBvby/rZhFCCGGMctwJIUTObdiwAcHBwdDV1YWrqytCQkJk3SRCCCEioI47IYTIOQUFBWhpaUEikUBFRQVqamqybhIhhBARUMedEELkXPPmzeHn54fs7Gz4+/ujSZMmsm4SIYQQEdDiVEIIkXMlJSUIDw9HQkICjI2N4eDgAGVlZVk3ixBCCGM04k4IIXJOIpGgrKwMHMehtLRU1s0hhBAiEuq4E0KInFuyZAkeP36Mbt264cmTJ1i8eLGsm0QIIUQEVA6SEELkXEpKCvbs2QMA6Nu3LxwdHWXcIkIIIWKgEXdCCJFzRUVFKCwsBAC8fv2a0mUIIaSWohF3QgiRc2PGjIGNjQ1MTEzw4MEDzJw5U9ZNIoQQIgKqKkMIIbVATk4OHj9+jGbNmkFBQQH169eXdZMIIYQwRqkyhBAi51asWAEtLS20bdsWt27dgoODg6ybRAghRASUKkMIIXJOXV0d69atQ0FBARITE7Fz505ZN4kQQogIKFWGEEJqgdWrVyMhIQG//PKLrJtCCCFEJNRxJ4QQOdWtW7dKv2dkZEBXVxcAEBMTI4smEUIIERF13AkhhBBCCJEDlONOCCFybsGCBf96zMfHRwYtIYQQIibquBNCiJwbOHAgAIDjONy5cwcvXryQcYsIIYSIgVJlCCGklpkwYQICAgJk3QxCCCGM0Yg7IYTIuYoLUdPT05GRkSHD1hBCCBELddwJIUTORUZGCv+trKwMb29vGbaGEEKIWChVhhBCaoGHDx8iNTUVpqam+OKLLyCRSGTdJEIIIYzRiDshhMi54OBgnDp1Ci9fvsTw4cORkpICLy8vWTeLEEIIYwqybgAhhJCqiYyMRGBgIDQ0NDB27FjcuHFD1k0ihBAiAuq4E0KInOMzHvn0GGVlZVk2hxBCiEgoVYYQQuTc4MGD4ezsjKdPn2Ly5Mno16+frJtECCFEBNRxJ4QQOeXn5yeMsjds2BBpaWlQUVFBTk6OjFtGCCFEDNRxJ4QQOWVkZCT8t6GhIXr06CHD1hBCCBEblYMkhBBCCCFEDtDiVEIIIYQQQuQAddwJIYQQQgiRA9RxJ4QQQgghRA5Qx50QQgghhBA5QB13QgghhBBC5MD/AbTK3ghaZq1EAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 1008x576 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.heatmap(pd.DataFrame(doc_term_matrix.todense(), columns=words), cmap='Blues')\n",
    "plt.gcf().set_size_inches(14, 8);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Using thresholds to reduce the number of tokens "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:59.276213Z",
     "start_time": "2020-06-20T17:17:58.857526Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2225, 12789)"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(max_df=.2, min_df=3, stop_words='english')\n",
    "doc_term_matrix = vectorizer.fit_transform(docs.body)\n",
    "doc_term_matrix.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Use CountVectorizer with Lemmatization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### Building a custom `tokenizer` for Lemmatization with `spacy`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:17:59.721748Z",
     "start_time": "2020-06-20T17:17:59.279559Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "nlp = spacy.load('en')\n",
    "def tokenizer(doc):\n",
    "    return [w.lemma_ for w in nlp(doc) \n",
    "                if not w.is_punct | w.is_space]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.455322Z",
     "start_time": "2020-06-20T17:17:59.723193Z"
    },
    "run_control": {
     "marked": false
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2225, 25665)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(tokenizer=tokenizer, binary=True)\n",
    "doc_term_matrix = vectorizer.fit_transform(docs.body)\n",
    "doc_term_matrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.473754Z",
     "start_time": "2020-06-20T17:19:13.456267Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "from      0.702022\n",
       "but       0.732135\n",
       "as        0.742022\n",
       "by        0.765843\n",
       "at        0.792809\n",
       "with      0.824719\n",
       "that      0.830562\n",
       "say       0.881798\n",
       "'s        0.896629\n",
       "on        0.906517\n",
       "for       0.930337\n",
       "have      0.972584\n",
       "in        0.990562\n",
       "and       0.991011\n",
       "of        0.991461\n",
       "a         0.992809\n",
       "-PRON-    0.993708\n",
       "to        0.995056\n",
       "be        0.998202\n",
       "the       1.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatized_words = vectorizer.get_feature_names()\n",
    "word_freq = doc_term_matrix.sum(axis=0)\n",
    "word_freq_1d = np.squeeze(np.asarray(word_freq))\n",
    "word_freq_1d = pd.Series(word_freq_1d, index=lemmatized_words).div(docs.shape[0])\n",
    "word_freq_1d.sort_values().tail(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Unlike verbs and common nouns, there's no clear base form of a personal pronoun. Should the lemma of \"me\" be \"I\", or should we normalize person as well, giving \"it\" — or maybe \"he\"? spaCy's solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Document-Term Matrix with `TfIDFVectorizer`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The TfIDFTransfomer computes the tf-idf weights from a document-term matrix of token counts like the one produced by the CountVectorizer. \n",
    "\n",
    "The TfIDFVectorizer performs both computations in a single step. It adds a few parameters to the CountVectorizer API that controls the smoothing behavior."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Key Parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `TfIDFTransformer` builds on the `CountVectorizer` output; the `TfIDFVectorizer` integrates both"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.479142Z",
     "start_time": "2020-06-20T17:19:13.474488Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transform a count matrix to a normalized tf or tf-idf representation\n",
      "\n",
      "    Tf means term-frequency while tf-idf means term-frequency times inverse\n",
      "    document-frequency. This is a common term weighting scheme in information\n",
      "    retrieval, that has also found good use in document classification.\n",
      "\n",
      "    The goal of using tf-idf instead of the raw frequencies of occurrence of a\n",
      "    token in a given document is to scale down the impact of tokens that occur\n",
      "    very frequently in a given corpus and that are hence empirically less\n",
      "    informative than features that occur in a small fraction of the training\n",
      "    corpus.\n",
      "\n",
      "    The formula that is used to compute the tf-idf for a term t of a document d\n",
      "    in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is\n",
      "    computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where\n",
      "    n is the total number of documents in the document set and df(t) is the\n",
      "    document frequency of t; the document frequency is the number of documents\n",
      "    in the document set that contain the term t. The effect of adding \"1\" to\n",
      "    the idf in the equation above is that terms with zero idf, i.e., terms\n",
      "    that occur in all documents in a training set, will not be entirely\n",
      "    ignored.\n",
      "    (Note that the idf formula above differs from the standard textbook\n",
      "    notation that defines the idf as\n",
      "    idf(t) = log [ n / (df(t) + 1) ]).\n",
      "\n",
      "    If ``smooth_idf=True`` (the default), the constant \"1\" is added to the\n",
      "    numerator and denominator of the idf as if an extra document was seen\n",
      "    containing every term in the collection exactly once, which prevents\n",
      "    zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.\n",
      "\n",
      "    Furthermore, the formulas used to compute tf and idf depend\n",
      "    on parameter settings that correspond to the SMART notation used in IR\n",
      "    as follows:\n",
      "\n",
      "    Tf is \"n\" (natural) by default, \"l\" (logarithmic) when\n",
      "    ``sublinear_tf=True``.\n",
      "    Idf is \"t\" when use_idf is given, \"n\" (none) otherwise.\n",
      "    Normalization is \"c\" (cosine) when ``norm='l2'``, \"n\" (none)\n",
      "    when ``norm=None``.\n",
      "\n",
      "    Read more in the :ref:`User Guide <text_feature_extraction>`.\n",
      "\n",
      "    Parameters\n",
      "    ----------\n",
      "    norm : {'l1', 'l2'}, default='l2'\n",
      "        Each output row will have unit norm, either:\n",
      "        * 'l2': Sum of squares of vector elements is 1. The cosine\n",
      "        similarity between two vectors is their dot product when l2 norm has\n",
      "        been applied.\n",
      "        * 'l1': Sum of absolute values of vector elements is 1.\n",
      "        See :func:`preprocessing.normalize`\n",
      "\n",
      "    use_idf : bool, default=True\n",
      "        Enable inverse-document-frequency reweighting.\n",
      "\n",
      "    smooth_idf : bool, default=True\n",
      "        Smooth idf weights by adding one to document frequencies, as if an\n",
      "        extra document was seen containing every term in the collection\n",
      "        exactly once. Prevents zero divisions.\n",
      "\n",
      "    sublinear_tf : bool, default=False\n",
      "        Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).\n",
      "\n",
      "    Attributes\n",
      "    ----------\n",
      "    idf_ : array of shape (n_features)\n",
      "        The inverse document frequency (IDF) vector; only defined\n",
      "        if  ``use_idf`` is True.\n",
      "\n",
      "        .. versionadded:: 0.20\n",
      "\n",
      "    Examples\n",
      "    --------\n",
      "    >>> from sklearn.feature_extraction.text import TfidfTransformer\n",
      "    >>> from sklearn.feature_extraction.text import CountVectorizer\n",
      "    >>> from sklearn.pipeline import Pipeline\n",
      "    >>> import numpy as np\n",
      "    >>> corpus = ['this is the first document',\n",
      "    ...           'this document is the second document',\n",
      "    ...           'and this is the third one',\n",
      "    ...           'is this the first document']\n",
      "    >>> vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',\n",
      "    ...               'and', 'one']\n",
      "    >>> pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),\n",
      "    ...                  ('tfid', TfidfTransformer())]).fit(corpus)\n",
      "    >>> pipe['count'].transform(corpus).toarray()\n",
      "    array([[1, 1, 1, 1, 0, 1, 0, 0],\n",
      "           [1, 2, 0, 1, 1, 1, 0, 0],\n",
      "           [1, 0, 0, 1, 0, 1, 1, 1],\n",
      "           [1, 1, 1, 1, 0, 1, 0, 0]])\n",
      "    >>> pipe['tfid'].idf_\n",
      "    array([1.        , 1.22314355, 1.51082562, 1.        , 1.91629073,\n",
      "           1.        , 1.91629073, 1.91629073])\n",
      "    >>> pipe.transform(corpus).shape\n",
      "    (4, 8)\n",
      "\n",
      "    References\n",
      "    ----------\n",
      "\n",
      "    .. [Yates2011] R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern\n",
      "                   Information Retrieval. Addison Wesley, pp. 68-74.\n",
      "\n",
      "    .. [MRS2008] C.D. Manning, P. Raghavan and H. Schütze  (2008).\n",
      "                   Introduction to Information Retrieval. Cambridge University\n",
      "                   Press, pp. 118-120.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(TfidfTransformer().__doc__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How Term Frequency - Inverse Document Frequency works"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The TFIDF computation works as follows for a small text sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.487159Z",
     "start_time": "2020-06-20T17:19:13.480078Z"
    }
   },
   "outputs": [],
   "source": [
    "sample_docs = ['call you tomorrow', \n",
    "                'Call me a taxi', \n",
    "                'please call me... PLEASE!']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-21T14:05:08.089371Z",
     "start_time": "2018-11-21T14:05:08.069616Z"
    }
   },
   "source": [
    "#### Compute term frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.496815Z",
     "start_time": "2020-06-20T17:19:13.488043Z"
    }
   },
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer()\n",
    "tf_dtm = vectorizer.fit_transform(sample_docs).todense()\n",
    "tokens = vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.511965Z",
     "start_time": "2020-06-20T17:19:13.498347Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   call  me  please  taxi  tomorrow  you\n",
      "0     1   0       0     0         1    1\n",
      "1     1   1       0     1         0    0\n",
      "2     1   1       2     0         0    0\n"
     ]
    }
   ],
   "source": [
    "term_frequency = pd.DataFrame(data=tf_dtm, \n",
    "                              columns=tokens)\n",
    "print(term_frequency)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compute document frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.523789Z",
     "start_time": "2020-06-20T17:19:13.512824Z"
    }
   },
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer(binary=True)\n",
    "df_dtm = vectorizer.fit_transform(sample_docs).todense().sum(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.538418Z",
     "start_time": "2020-06-20T17:19:13.524699Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   call  me  please  taxi  tomorrow  you\n",
      "0     3   2       1     1         1    1\n"
     ]
    }
   ],
   "source": [
    "document_frequency = pd.DataFrame(data=df_dtm,\n",
    "                                  columns=tokens)\n",
    "print(document_frequency)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compute TfIDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.552926Z",
     "start_time": "2020-06-20T17:19:13.540654Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       call   me  please  taxi  tomorrow  you\n",
      "0  0.333333  0.0     0.0   0.0       1.0  1.0\n",
      "1  0.333333  0.5     0.0   1.0       0.0  0.0\n",
      "2  0.333333  0.5     2.0   0.0       0.0  0.0\n"
     ]
    }
   ],
   "source": [
    "tfidf = pd.DataFrame(data=tf_dtm/df_dtm, columns=tokens)\n",
    "print(tfidf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The effect of smoothing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-21T14:07:22.382406Z",
     "start_time": "2018-11-21T14:07:22.373832Z"
    }
   },
   "source": [
    "The TfidfVectorizer uses smoothing for document and term frequencies:\n",
    "- `smooth_idf`: add one to document frequency, as if an extra document contained every token in the vocabulary\n",
    "     once to prevents zero divisions\n",
    "- `sublinear_tf`: scale term Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.568104Z",
     "start_time": "2020-06-20T17:19:13.555181Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       call        me    please      taxi  tomorrow       you\n",
      "0  0.385372  0.000000  0.000000  0.000000  0.652491  0.652491\n",
      "1  0.425441  0.547832  0.000000  0.720333  0.000000  0.000000\n",
      "2  0.266075  0.342620  0.901008  0.000000  0.000000  0.000000\n"
     ]
    }
   ],
   "source": [
    "vect = TfidfVectorizer(smooth_idf=True, \n",
    "                       norm='l2',            # squared weights sum to 1 by document\n",
    "                       sublinear_tf=False,   # if True, use 1+log(tf)\n",
    "                       binary=False)\n",
    "print(pd.DataFrame(vect.fit_transform(sample_docs).todense(), \n",
    "             columns=vect.get_feature_names()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-21T14:10:41.311223Z",
     "start_time": "2018-11-21T14:10:41.298041Z"
    }
   },
   "source": [
    "### TfIDF with new articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to their ability to assign meaningful token weights, TFIDF vectors are also used to summarize text data. E.g., reddit's autotldr function is based on a similar algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.971058Z",
     "start_time": "2020-06-20T17:19:13.569533Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2225, 28980)"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf = TfidfVectorizer(stop_words='english')\n",
    "dtm_tfidf = tfidf.fit_transform(docs.body)\n",
    "tokens = tfidf.get_feature_names()\n",
    "dtm_tfidf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.979715Z",
     "start_time": "2020-06-20T17:19:13.971962Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "token_freq = (pd.DataFrame({'tfidf': dtm_tfidf.sum(axis=0).A1,\n",
    "                            'token': tokens})\n",
    "              .sort_values('tfidf', ascending=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.987241Z",
     "start_time": "2020-06-20T17:19:13.980597Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tfidf</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>said</th>\n",
       "      <td>87.251494</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mr</th>\n",
       "      <td>58.220783</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <td>41.982178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>people</th>\n",
       "      <td>37.303707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>new</th>\n",
       "      <td>34.197388</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>film</th>\n",
       "      <td>29.728250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>government</th>\n",
       "      <td>28.792651</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>world</th>\n",
       "      <td>27.031199</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>time</th>\n",
       "      <td>26.358319</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>best</th>\n",
       "      <td>26.304266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>baked</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pavlovian</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>buzzcocks</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sisterhood</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>siouxsie</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sioux</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bane</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>biassed</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>duetted</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>speechless</th>\n",
       "      <td>0.014186</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                tfidf\n",
       "token                \n",
       "said        87.251494\n",
       "mr          58.220783\n",
       "year        41.982178\n",
       "people      37.303707\n",
       "new         34.197388\n",
       "film        29.728250\n",
       "government  28.792651\n",
       "world       27.031199\n",
       "time        26.358319\n",
       "best        26.304266\n",
       "baked        0.014186\n",
       "pavlovian    0.014186\n",
       "buzzcocks    0.014186\n",
       "sisterhood   0.014186\n",
       "siouxsie     0.014186\n",
       "sioux        0.014186\n",
       "bane         0.014186\n",
       "biassed      0.014186\n",
       "duetted      0.014186\n",
       "speechless   0.014186"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "token_freq.head(10).append(token_freq.tail(10)).set_index('token')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summarizing news articles using TfIDF weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Select random article"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:13.995022Z",
     "start_time": "2020-06-20T17:19:13.988041Z"
    }
   },
   "outputs": [],
   "source": [
    "article = docs.sample(1).squeeze()\n",
    "article_id = article.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.003298Z",
     "start_time": "2020-06-20T17:19:13.995869Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic:\tBusiness\n",
      "\n",
      "France Telecom gets Orange boost\n",
      "\n",
      "Strong growth in subscriptions to mobile phone network Orange has helped boost profits at owner France Telecom.  Orange added more than five million new customers in 2004, leading to a 10% increase in its revenues. Increased take-up of broadband telecoms services also boosted France Telecom's profits, which showed a 5.5% rise to 18.3bn euros ($23.4bn; Â£12.5bn). France Telecom is to spend 578m euros on buying out minority shareholders in data services provider Equant.  France Telecom, one of the world's largest telecoms and internet service providers, saw its full-year sales rise 2.2% to 47.2bn euros in 2004.  Orange enjoyed strong growth outside France and the United Kingdom - its core markets - swelling its subscriber base to 5.4 million. France Telecom's broadband customers also increased, rising to 5.1 million across Europe by the end of the year. The firm said it had met its main strategic objectives of growing its individual businesses and further reducing its large debt. An ill-fated expansion drive in the late 1990s saw France Telecom's debt soar to 72bn euros by 2002. However, this has now been reduced to 43.9bn euros. \"Our results for 2004 allow us to improve our financial structure while focusing on the innovation that drives our strategy,\" said chief executive Thierry Breton.  Looking ahead, the company forecast like-for-like sales growth of between 3% and 5% over the next three years. France Telecom is consolidating its interest in Equant, which provides telecoms and data services to businesses. Subject to approval by shareholders of the two firms, it will buy the shares in Equant it does not already own. France Telecom said it would fund the deal by selling an 8% stake in telephone directory company PagesJaunes.\n"
     ]
    }
   ],
   "source": [
    "print(f'Topic:\\t{article.topic.capitalize()}\\n\\n{article.heading}\\n')\n",
    "print(article.body.strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Select most relevant tokens by tfidf value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.017696Z",
     "start_time": "2020-06-20T17:19:14.004071Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "telecom         0.540529\n",
       "france          0.341326\n",
       "equant          0.261060\n",
       "euros           0.244469\n",
       "orange          0.186060\n",
       "telecoms        0.160378\n",
       "services        0.108252\n",
       "growth          0.106366\n",
       "shareholders    0.102073\n",
       "businesses      0.097149\n",
       "dtype: float64"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "article_tfidf = dtm_tfidf[article_id].todense().A1\n",
    "article_tokens = pd.Series(article_tfidf, index=tokens)\n",
    "article_tokens.sort_values(ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compare to random selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.024287Z",
     "start_time": "2020-06-20T17:19:14.018468Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['our',\n",
       " 'in',\n",
       " 'at',\n",
       " 'Breton.',\n",
       " 'and',\n",
       " 'the',\n",
       " 'improve',\n",
       " 'to',\n",
       " 'subscriber',\n",
       " 'in']"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(article.body.split()).sample(10).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Create Train & Test Sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stratified `train_test_split`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.034638Z",
     "start_time": "2020-06-20T17:19:14.025176Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "train_docs, test_docs = train_test_split(docs, \n",
    "                                         stratify=docs.topic, \n",
    "                                         test_size=50, \n",
    "                                         random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.042605Z",
     "start_time": "2020-06-20T17:19:14.035501Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2175, 3), (50, 3))"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_docs.shape, test_docs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.053701Z",
     "start_time": "2020-06-20T17:19:14.043453Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sport            12\n",
       "business         11\n",
       "entertainment     9\n",
       "politics          9\n",
       "tech              9\n",
       "Name: topic, dtype: int64"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(test_docs.topic).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Vectorize train & test sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.533675Z",
     "start_time": "2020-06-20T17:19:14.054889Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2175x2000 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 178765 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(max_df=.2, \n",
    "                             min_df=3, \n",
    "                             stop_words='english', \n",
    "                             max_features=2000)\n",
    "\n",
    "train_dtm = vectorizer.fit_transform(train_docs.body)\n",
    "words = vectorizer.get_feature_names()\n",
    "train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:19:14.547234Z",
     "start_time": "2020-06-20T17:19:14.534611Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<50x2000 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 4043 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_dtm = vectorizer.transform(test_docs.body)\n",
    "test_dtm"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "name": "_merged",
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "379.503px",
    "left": "24px",
    "right": "1064px",
    "top": "66.3352px",
    "width": "343.575px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
